Commit Graph

192 Commits

Author SHA1 Message Date
Peter Zijlstra f98393a64c mm: remove destroy_dirty_buffers from invalidate_bdev()
Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
been used in 6 years (so akpm says).

find * -name \*.[ch] | xargs grep -l invalidate_bdev |
while read file; do
	quilt add $file;
	sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
done

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
Andrew Morton 1aa9b4b9bc [PATCH] revert "retries in ext3_prepare_write() violate ordering requirements"
Revert e92a4d595b.

Dmitry points out

"When we block_prepare_write() failed while ext3_prepare_write() we jump to
 "failure" label and call ext3_prepare_failure() witch search last mapped bh
 and invoke commit_write untill it.  This is wrong!!  because some bh from
 begining to the last mapped bh may be not uptodate.  As a result we commit to
 disk not uptodate page content witch contains garbage from previous usage."

and

"Unexpected file size increasing."

   Call trace the same as it was in first issue but result is different.
   For example we have file with i_size is zero.  we want write two blocks ,
   but fs has only one free block.

   ->ext3_prepare_write(...from == 0, to == 2048)
     retry:
     ->block_prepare_write() == -ENOSPC# we failed but allocated one block here.
     ->ext3_prepare_failure()
       ->commit_write( from == 0, to == 1024) # after this i_size becomes 1024 :)
     if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
        goto retry;

   Finally when all retries will be spended ext3_prepare_failure return
   -ENOSPC, but i_size was increased and later block trimm procedures can't
   help here.

We don't appear to have the horsepower to fix these issues, so let's put
things back the way they were for now.

Cc: Kirill Korotaev <dev@openvz.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ken Chen <kenneth.w.chen@intel.com>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: <linux-ext4@vger.kernel.org>
Cc: Dmitriy Monakhov <dmonakhov@openvz.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-02 10:06:08 -07:00
Andrew Morton 105fd108a6 [PATCH] "ext[34]: EA block reference count racing fix" performance fix
A little mistake in 8a2bfdcbfa is making all
transactions synchronous, which reduces ext3 performance to comical levels.

Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-23 11:01:22 -07:00
Mingming Cao 8a2bfdcbfa [PATCH] ext[34]: EA block reference count racing fix
There are race issues around ext[34] xattr block release code.

ext[34]_xattr_release_block() checks the reference count of xattr block
(h_refcount) and frees that xattr block if it is the last one reference it.
 Unlike ext2, the check of this counter is unprotected by any lock.
ext[34]_xattr_release_block() will free the mb_cache entry before freeing
that xattr block.  There is a small window between the check for the re
h_refcount ==1 and the call to mb_cache_entry_free().  During this small
window another inode might find this xattr block from the mbcache and reuse
it, racing a refcount updates.  The xattr block will later be freed by the
first inode without notice other inode is still use it.  Later if that
block is reallocated as a datablock for other file, then more serious
problem might happen.

We need put a lock around places checking the refount as well to avoid
racing issue.  Another place need this kind of protection is in
ext3_xattr_block_set(), where it will modify the xattr block content in-
the-fly if the refcount is 1 (means it's the only inode reference it).

This will also fix another issue: the xattr block may not get freed at all
if no lock is to protect the refcount check at the release time.  It is
possible that the last two inodes could release the shared xattr block at
the same time.  But both of them think they are not the last one so only
decreased the h_refcount without freeing xattr block at all.

We need to call lock_buffer() after ext3_journal_get_write_access() to
avoid deadlock (because the later will call lock_buffer()/unlock_buffer
() as well).

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Cc: Andreas Gruenbacher <agruen@suse.de>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:38 -08:00
Aneesh Kumar K.V e627432c29 [PATCH] ext[234]: update documentation
Signed-off-by: "Aneesh Kumar K.V" <aneesh.kumar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:14 -08:00
Tim Schmielau cd354f1ae7 [PATCH] remove many unneeded #includes of sched.h
After Al Viro (finally) succeeded in removing the sched.h #include in module.h
recently, it makes sense again to remove other superfluous sched.h includes.
There are quite a lot of files which include it but don't actually need
anything defined in there.  Presumably these includes were once needed for
macros that used to live in sched.h, but moved to other header files in the
course of cleaning it up.

To ease the pain, this time I did not fiddle with any header files and only
removed #includes from .c-files, which tend to cause less trouble.

Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
configs in arch/arm/configs on arm.  I also checked that no new warnings were
introduced by the patch (actually, some warnings are removed that were emitted
by unnecessarily included header files).

Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:54 -08:00
Josef 'Jeff' Sipek ee9b6d61a2 [PATCH] Mark struct super_operations const
This patch is inspired by Arjan's "Patch series to mark struct
file_operations and struct inode_operations const".

Compile tested with gcc & sparse.

Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:47 -08:00
Arjan van de Ven 754661f143 [PATCH] mark struct inode_operations const 1
Many struct inode_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:46 -08:00
Dmitriy Monakhov 3e4fdaf8ae [PATCH] jbd layer function called instead of fs specific one
jbd function called instead of fs specific one.

Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 11:18:06 -08:00
Eric Sandeen 731b9a5498 [PATCH] remove ext[34]_inc_count and _dec_count
- Naming is confusing, ext3_inc_count manipulates i_nlink not i_count
- handle argument passed in is not used
- ext3 and ext4 already call inc_nlink and dec_nlink directly in other places

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:34 -08:00
Eric Sandeen 2988a7740d [PATCH] return ENOENT from ext3_link when racing with unlink
Return -ENOENT from ext[34]_link if we've raced with unlink and i_nlink is
0.  Doing otherwise has the potential to corrupt the orphan inode list,
because we'd wind up with an inode with a non-zero link count on the list,
and it will never get properly cleaned up & removed from the orphan list
before it is freed.

[akpm@osdl.org: build fix]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:34 -08:00
Hugh Dickins 2e7842b887 [PATCH] fix umask when noACL kernel meets extN tuned for ACLs
Fix insecure default behaviour reported by Tigran Aivazian: if an ext2 or
ext3 or ext4 filesystem is tuned to mount with "acl", but mounted by a
kernel built without ACL support, then umask was ignored when creating
inodes - though root or user has umask 022, touch creates files as 0666,
and mkdir creates directories as 0777.

This appears to have worked right until 2.6.11, when a fix to the default
mode on symlinks (always 0777) assumed VFS applies umask: which it does,
unless the mount is marked for ACLs; but ext[234] set MS_POSIXACL in
s_flags according to s_mount_opt set according to def_mount_opts.

We could revert to the 2.6.10 ext[234]_init_acl (adding an S_ISLNK test);
but other filesystems only set MS_POSIXACL when ACLs are configured.  We
could fix this at another level; but it seems most robust to avoid setting
the s_mount_opt flag in the first place (at the expense of more ifdefs).

Likewise don't set the XATTR_USER flag when built without XATTR support.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: <linux-ext4@vger.kernel.org>
Cc: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:34 -08:00
Eric Sandeen ea9a05a133 [PATCH] ext3: refuse ro to rw remount of fs with orphan inodes
In the rare case where we have skipped orphan inode processing due to a
readonly block device, and the block device subsequently changes back to
read-write, disallow a remount,rw transition of the filesystem when we have an
unprocessed orphan inodes as this would corrupt the list.

Ideally we should process the orphan inode list during the remount, but that's
trickier, and this plugs the hole for now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:34 -08:00
David Howells f0d1b0b30d [PATCH] LOG2: Implement a general integer log2 facility in the kernel
This facility provides three entry points:

	ilog2()		Log base 2 of unsigned long
	ilog2_u32()	Log base 2 of u32
	ilog2_u64()	Log base 2 of u64

These facilities can either be used inside functions on dynamic data:

	int do_something(long q)
	{
		...;
		y = ilog2(x)
		...;
	}

Or can be used to statically initialise global variables with constant values:

	unsigned n = ilog2(27);

When performing static initialisation, the compiler will report "error:
initializer element is not constant" if asked to take a log of zero or of
something not reducible to a constant.  They treat negative numbers as
unsigned.

When not dealing with a constant, they fall back to using fls() which permits
them to use arch-specific log calculation instructions - such as BSR on
x86/x86_64 or SCAN on FRV - if available.

[akpm@osdl.org: MMC fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Howells <dhowells@redhat.com>
Cc: Wojtek Kaniewski <wojtekka@toxygen.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:51 -08:00
Josef "Jeff" Sipek fe21a69389 [PATCH] ext3: change uses of f_{dentry, vfsmnt} to use f_path
Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in the ext3
filesystem.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:41 -08:00
Hugh Dickins 7d1c520bb5 [PATCH] ext3 balloc: fix _with_rsv freeze
Port fix to the off-by-one in find_next_usable_block's memscan from ext2 to
ext3; but it didn't cause a serious problem for ext3 because the additional
ext3_test_allocatable check rescued it from the error.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:48 -08:00
Hugh Dickins 2823b5535e [PATCH] ext3 balloc: use io_error label
ext3_new_blocks has a nice io_error label for setting -EIO, so goto that in
the one place that doesn't already use it.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:48 -08:00
Hugh Dickins c56d2561f7 [PATCH] ext3 balloc: say rb_entry not list_entry
The reservations tree is an rb_tree not a list, so it's less confusing to use
rb_entry() than list_entry() - though they're both just container_of().

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:48 -08:00
Hugh Dickins ff50dc562b [PATCH] ext3 balloc: fix off-by-one against rsv_end
rsv_end is the last block within the reservation, so alloc_new_reservation
should accept start_block == rsv_end as success.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:48 -08:00
Hugh Dickins 1650242324 [PATCH] ext3 balloc: fix off-by-one against grp_goal
grp_goal 0 is a genuine goal (unlike -1), so ext3_try_to_allocate_with_rsv
should treat it as such.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:48 -08:00
Hugh Dickins ef5036782e [PATCH] ext3 balloc: reset windowsz when full
ext3_new_blocks should reset the reservation window size to 0 when squeezing
the last blocks out of an almost full filesystem, so the retry doesn't skip
any groups with less than half that free, reporting ENOSPC too soon.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:48 -08:00
Eric Sandeen a8f48a9561 [PATCH] ext3/4: don't do orphan processing on readonly devices
If you do something like:

  # touch foo
  # tail -f foo &
  # rm foo
  # <take snapshot>
  # <mount snapshot>

you'll panic, because ext3/4 tries to do orphan list processing on the
readonly snapshot device, and:

  kernel: journal commit I/O error
  kernel: Assertion failure in journal_flush_Rsmp_e2f189ce() at journal.c:1356: "!journal->j_checkpoint_transactions"
  kernel: Kernel panic: Fatal exception

for a truly readonly underlying device, it's reasonable and necessary
to just skip orphan list processing.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:44 -08:00
Mingming Cao 2bd94bd79e [PATCH] ext3: fix reservation extension
Hugh Dickins wrote:
> Not found anything relevant, but I keep noticing these lines
> in ext2_try_to_allocate_with_rsv(), ext3 and ext4 similar:
>
> 		} else if (grp_goal > 0 &&
> 				(my_rsv->rsv_end - grp_goal + 1) < *count)
> 			try_to_extend_reservation(my_rsv, sb,
> 					*count-my_rsv->rsv_end + grp_goal - 1);
>
> They're wrong, a no-op in most groups, aren't they?  rsv_end is an
> absolute block number, whereas grp_goal is group-relative, so the
> calculation ought to bring in group_first_block?  Or I'm confused.
>

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Cc: "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:39 -08:00
Andrey Savochkin e92a4d595b [PATCH] retries in ext3_prepare_write() violate ordering requirements
In journal=ordered or journal=data mode retry in ext3_prepare_write()
breaks the requirements of journaling of data with respect to metadata.
The fix is to call commit_write to commit allocated zero blocks before
retry.

Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ken Chen <kenneth.w.chen@intel.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:37 -08:00
Andrew Morton 3a229b39eb [PATCH] ext3: uninline large functions
Saves nearly 4kbytes on x86.

Cc: Arnaldo Carvalho de Melo <acme@mandriva.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:35 -08:00
Eric Sandeen 40b851348f [PATCH] handle ext3 directory corruption better
I've been using Steve Grubb's purely evil "fsfuzzer" tool, at
http://people.redhat.com/sgrubb/files/fsfuzzer-0.4.tar.gz

Basically it makes a filesystem, splats some random bits over it, then
tries to mount it and do some simple filesystem actions.

At best, the filesystem catches the corruption gracefully.  At worst,
things spin out of control.

As you might guess, we found a couple places in ext3 where things spin out
of control :)

First, we had a corrupted directory that was never checked for
consistency...  it was corrupt, and pointed to another bad "entry" of
length 0.  The for() loop looped forever, since the length of
ext3_next_entry(de) was 0, and we kept looking at the same pointer over and
over and over and over...  I modeled this check and subsequent action on
what is done for other directory types in ext3_readdir...

(adding this check adds some computational expense; I am testing a followup
patch to reduce the number of times we check and re-check these directory
entries, in all cases.  Thanks for the idea, Andreas).

Next we had a root directory inode which had a corrupted size, claimed to
be > 200M on a 4M filesystem.  There was only really 1 block in the
directory, but because the size was so large, readdir kept coming back for
more, spewing thousands of printk's along the way.

Per Andreas' suggestion, if we're in this read error condition and we're
trying to read an offset which is greater than i_blocks worth of bytes,
stop trying, and break out of the loop.

With these two changes fsfuzz test survives quite well on ext3.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:33 -08:00
Andreas Gruenbacher ed2908f313 [PATCH] Remove superfluous lock_super() in extN xattr code
lock_super() is unnecessary for setting super-block feature flags.  Use the
provided *_SET_COMPAT_FEATURE() macros as well.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:32 -08:00
Pekka Enberg 50ee0a32b1 [PATCH] ext3: fsid for statvfs
Update ext3_statfs to return an FSID that is a 64 bit XOR of the 128 bit
filesystem UUID as suggested by Andreas Dilger.  See the following Bugzilla
entry for details:

  http://bugzilla.kernel.org/show_bug.cgi?id=136

Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Stephen Tweedie <sct@redhat.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:31 -08:00
Christoph Lameter e18b890bb0 [PATCH] slab: remove kmem_cache_t
Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

	#!/bin/sh
	#
	# Replace one string by another in all the kernel sources.
	#

	set -e

	for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
		quilt add $file
		sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
		mv /tmp/$$ $file
		quilt refresh
	done

The script was run like this

	sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Lameter e6b4f8da3a [PATCH] slab: remove SLAB_NOFS
SLAB_NOFS is an alias of GFP_NOFS.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00
Dmitry Mishin 2245d7c21f [PATCH] ext3: errors behaviour fix
Current error behaviour for ext2 and ext3 filesystems does not fully
correspond to the documentation and should be fixed.

According to man 8 mount, ext2 and ext3 file systems allow to set one of 3
different on-errors behaviours:

  ---- start of quote man 8 mount ----

  errors=continue / errors=remount-ro / errors=panic

    Define the behaviour when an error is encountered.  (Either ignore
    errors and just mark the file system erroneous and continue, or remount
    the file system read-only, or panic and halt the system.) The default is
    set in the filesystem superblock, and can be changed using tune2fs(8).

  ---- end of quote ----

However EXT3_ERRORS_CONTINUE is not read from the superblock, and thus
ERRORS_CONT is not saved on the sbi->s_mount_opt.  It leads to the incorrect
handle of errors on ext3.

Then we've checked corresponding code in ext2 and discovered that it is buggy
as well:

- EXT2_ERRORS_CONTINUE is not read from the superblock (the same);

- parse_option() does not clean the alternative values and thus something
  like (ERRORS_CONT|ERRORS_RO) can be set;

- if options are omitted, parse_option() does not set any of these options.

Therefore it is possible to set any combination of these options on the ext2:

- none of them may be set: EXT2_ERRORS_CONTINUE on superblock / empty mount
  options;

- any of them may be set using mount options;

- 2 any options may be set: by using EXT2_ERRORS_RO/EXT2_ERRORS_PANIC on the
  superblock and other value in mount options;

- and finally all three options may be set by adding third option in remount.

Currently ext2 uses these values only in ext2_error() and it is not leading to
any noticeable troubles.  However somebody may be discouraged when he will try
to workaround EXT2_ERRORS_PANIC on the superblock by using errors=continue in
mount options.

This patch:

EXT3_ERRORS_CONTINUE should be taken from the superblock as default value for
error behaviour.

Signed-off-by:	Dmitry Mishin <dim@openvz.org>
Acked-by:	Vasily Averin <vvs@sw.ru>
Acked-by: 	Kirill Korotaev <dev@openvz.org>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:21 -07:00
Dave Hansen ce71ec3684 [PATCH] r/o bind mounts: monitor zeroing of i_nlink
Some filesystems, instead of simply decrementing i_nlink, simply zero it
during an unlink operation.  We need to catch these in addition to the
decrement operations.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:30 -07:00
Dave Hansen d8c76e6f45 [PATCH] r/o bind mount prepwork: inc_nlink() helper
This is mostly included for parity with dec_nlink(), where we will have some
more hooks.  This one should stay pretty darn straightforward for now.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:30 -07:00
Dave Hansen 9a53c3a783 [PATCH] r/o bind mounts: unlink: monitor i_nlink
When a filesystem decrements i_nlink to zero, it means that a write must be
performed in order to drop the inode from the filesystem.

We're shortly going to have keep filesystems from being remounted r/o between
the time that this i_nlink decrement and that write occurs.

So, add a little helper function to do the decrements.  We'll tie into it in a
bit to note when i_nlink hits zero.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:30 -07:00
Badari Pulavarty ee0b3e671b [PATCH] Remove readv/writev methods and use aio_read/aio_write instead
This patch removes readv() and writev() methods and replaces them with
aio_read()/aio_write() methods.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:28 -07:00
Badari Pulavarty 027445c372 [PATCH] Vectorize aio_read/aio_write fileop methods
This patch vectorizes aio_read() and aio_write() methods to prepare for
collapsing all aio & vectored operations into one interface - which is
aio_read()/aio_write().

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Michael Holzheu <HOLZHEU@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:28 -07:00
David Howells 52a700c567 [PATCH] BLOCK: Move the Ext3 device ioctl compat stuff to the Ext3 driver [try #6]
Move the Ext3 device ioctl compat stuff from fs/compat_ioctl.c to the Ext3
driver so that the Ext3 header file doesn't need to be included.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:29 +02:00
Jens Axboe caa38fb0f4 [PATCH] ext3: make meta data reads use READ_META
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-09-30 20:29:42 +02:00
Theodore Ts'o ba52de123d [PATCH] inode-diet: Eliminate i_blksize from the inode structure
This eliminates the i_blksize field from struct inode.  Filesystems that want
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.

Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.

[bunk@stusta.de: cleanup]
[akpm@osdl.org: generic_fillattr() fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:18 -07:00
Alexey Dobriyan 1a1d92c10d [PATCH] Really ignore kmem_cache_destroy return value
* Rougly half of callers already do it by not checking return value
* Code in drivers/acpi/osl.c does the following to be sure:

	(void)kmem_cache_destroy(cache);

* Those who check it printk something, however, slab_error already printed
  the name of failed cache.
* XFS BUGs on failed kmem_cache_destroy which is not the decision
  low-level filesystem driver should make. Converted to ignore.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:10 -07:00
Panagiotis Issaris f52720ca5f [PATCH] fs: Removing useless casts
* Removing useless casts
* Removing useless wrapper
* Conversion from kmalloc+memset to kzalloc

Signed-off-by: Panagiotis Issaris <takis@issaris.org>
Acked-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:10 -07:00
Panagiotis Issaris f8314dc60c [PATCH] fs: Conversions from kmalloc+memset to k(z|c)alloc
Conversions from kmalloc+memset to kzalloc.

Signed-off-by: Panagiotis Issaris <takis@issaris.org>
Jffs2-bit-acked-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:10 -07:00
Eric Sandeen 32c2d2bc4b [PATCH] more ext3 16T overflow fixes
Some of the changes in balloc.c are just cosmetic, as Andreas pointed out -
if they overflow they'll then underflow and things are fine.

5th hunk actually fixes an overflow problem.

Also check for potential overflows in inode & block counts when resizing.

Signed-off-by: Eric Sandeen <esandeen@redhat.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:10 -07:00
Dave Kleikamp a4e4de36dc [PATCH] ext3: Fix sparse warnings
Fixing up some endian-ness warnings in preparation to clone ext4 from ext3.

Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:10 -07:00
Dave Kleikamp e9ad5620bf [PATCH] ext3: More whitespace cleanups
More white space cleanups in preparation of cloning ext4 from ext3.
Removing spaces that precede a tab.

Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:10 -07:00
Vasily Averin 7543fc7b3a [PATCH] ext3: wrong error behavior
SWsoft Virtuozzo/OpenVZ Linux kernel team has discovered that ext3 error
behavior was broken in linux kernels since 2.5.x versions by the following
patch:

2002/10/31 02:15:26-05:00 tytso@snap.thunk.org
Default mount options from superblock for ext2/3 filesystems
http://linux.bkbits.net:8080/linux-2.6/gnupatch@3dc0d88eKbV9ivV4ptRNM8fBuA3JBQ

In case ext3 file system is mounted with errors=continue
(EXT3_ERRORS_CONTINUE) errors should be ignored when possible.  However at
present in case of any error kernel aborts journal and remounts filesystem
to read-only.  Such behavior was hit number of times and noted to differ
from that of 2.4.x kernels.

This patch fixes this:
- do nothing in case of EXT3_ERRORS_CONTINUE,
- set EXT3_MOUNT_ABORT and call journal_abort() in all other cases
- panic() should be called after ext3_commit_super() to save
 sb marked as EXT3_ERROR_FS

Signed-off-by: Vasily Averin <vvs@sw.ru>
Acked-by: Kirill Korotaev <dev@sw.ru>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:09 -07:00
Mingming Cao 36faadc144 [PATCH] ext3: more comments about block allocation/reservation code
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:09 -07:00
Mingming Cao 321fb9e818 [PATCH] ext3: turn on reservation dump on block allocation errors
In the past there were a few kernel panics related to block reservation
tree operations failure (insert/remove etc).  It would be very useful to
get the block allocation reservation map info when such error happens.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:09 -07:00
Eric Sandeen eee194e76c [PATCH] ext3: inode numbers are unsigned long
This is primarily format string fixes, with changes to ialloc.c where large
inode counts could overflow, and also pass around journal_inum as an
unsigned long, just to be pedantic about it....

Signed-off-by: Eric Sandeen <esandeen@redhat.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:09 -07:00
Eric Sandeen 855565e81a [PATCH] fix ext3 mounts at 16T
I need to do some actual IO testing now, but this gets things mounting for
a 16T ext3 filesystem.  (patched up e2fsprogs is needed too, I'll send that
off the kernel list)

This patch fixes these issues in the kernel:

o sbi->s_groups_count overflows in ext3_fill_super()

	sbi->s_groups_count = (le32_to_cpu(es->s_blocks_count) -
			       le32_to_cpu(es->s_first_data_block) +
			       EXT3_BLOCKS_PER_GROUP(sb) - 1) /
			      EXT3_BLOCKS_PER_GROUP(sb);

  at 16T, s_blocks_count is already maxed out; adding
  EXT3_BLOCKS_PER_GROUP(sb) overflows it and groups_count comes out to 0.
  Not really what we want, and causes a failed mount.

  Feel free to check my math (actually, please do!), but changing it this
  way should work & avoid the overflow:

  (A + B - 1)/B changed to: ((A - 1)/B) + 1

o ext3_check_descriptors() overflows range checks

  ext3_check_descriptors() iterates over all block groups making sure
  that various bits are within the right block ranges...  on the last pass
  through, it is checking the error case

   [item] >= block + EXT3_BLOCKS_PER_GROUP(sb)

  where "block" is the first block in the last block group.  The last
  block in this group (and the last one that will fit in 32 bits) is block
  + EXT3_BLOCKS_PER_GROUP(sb)- 1.  block + EXT3_BLOCKS_PER_GROUP(sb) wraps
  back around to 0.

  so, make things clearer with "first_block" and "last_block" where those
  are first and last, inclusive, and use <, > rather than <, >=.

  Finally, the last block group may be smaller than the rest, so account
  for this on the last pass through: last_block = sb->s_blocks_count - 1;

(a similar patch could be done for ext2; does anyone in their right mind
use ext2 at 16T?  I'll send an ext2 patch doing the same thing if that's
warranted)

Signed-off-by: Eric Sandeen <esandeen@redhat.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:09 -07:00
Mingming Cao ae6ddcc5f2 [PATCH] ext3 and jbd cleanup: remove whitespace
Remove whitespace from ext3 and jbd, before we clone ext4.

Signed-off-by: Mingming Cao<cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:09 -07:00
Suparna Bhattacharya 20acaa18d0 [PATCH] ext3 sequential read regression fix
ext3-get-blocks support caused ~20% degrade in Sequential read
performance (tiobench). Problem is with marking the buffer boundary
so IO can be submitted right away. Here is the patch to fix it.

  2.6.18-rc6:
  -----------
  # ./iotest
  1048576+0 records in
  1048576+0 records out
  4294967296 bytes (4.3 GB) copied, 75.2726 seconds, 57.1 MB/s

  real    1m15.285s
  user    0m0.276s
  sys     0m3.884s

  2.6.18-rc6 + fix:
  -----------------
  [root@elm3a241 ~]# ./iotest
  1048576+0 records in
  1048576+0 records out
  4294967296 bytes (4.3 GB) copied, 62.9356 seconds, 68.2 MB/s

The boundary block check in ext3_get_blocks_handle needs to be adjusted
against the count of blocks mapped in this call, now that it can map
more than one block.

Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Tested-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-16 12:54:32 -07:00
NeilBrown fdb36673a9 [PATCH] knfsd: Make ext3 reject filehandles referring to invalid inode number
Inodes earlier than the 'first' inode (e.g.  journal, resize) should be
rejected early - except the root inode.  Also inode numbers that are too
big should be rejected early.

[akpm@osdl.org: cleanup]
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-16 12:54:31 -07:00
Badari Pulavarty 3665d0e58f [PATCH] ext3_getblk() should handle HOLE correctly
It has been reported that ext3_getblk() is not doing the right thing and
triggering following WARN():

BUG: warning at fs/ext3/inode.c:1016/ext3_getblk()
 <c01c5140> ext3_getblk+0x98/0x2a6  <c03b2806> md_wakeup_thread+0x26/0x2a
 <c01c536d> ext3_bread+0x1f/0x88  <c01cedf9> ext3_quota_read+0x136/0x1ae
 <c018b683> v1_read_dqblk+0x61/0xac  <c0188f32> dquot_acquire+0xf6/0x107
 <c01ceaba> ext3_acquire_dquot+0x46/0x68  <c01897d4> dqget+0x155/0x1e7
 <c018a97b> dquot_transfer+0x3e0/0x3e9  <c016fe52> dput+0x23/0x13e
 <c01c7986> ext3_setattr+0xc3/0x240  <c0120f66> current_fs_time+0x52/0x6a
 <c017320e> notify_change+0x2bd/0x30d  <c0159246> chown_common+0x9c/0xc5
 <c02a222c> strncpy_from_user+0x3b/0x68  <c0167fe6> do_path_lookup+0xdf/0x266
 <c016841b> __user_walk_fd+0x44/0x5a  <c01592b9> sys_chown+0x4a/0x55
 <c015a43c> vfs_write+0xe7/0x13c  <c01695d4> sys_mkdir+0x1f/0x23
 <c0102a97> syscall_call+0x7/0xb

Looking at the code, it looks like it's not handle HOLE correctly.  It ends
up returning -EIO.  Here is the patch to fix it.

If we really want to be paranoid, we can allow return values 0 (HOLE), 1
(we asked for one block) and return -EIO for more than 1 block.  But I
really don't see a reason for doing it - all we need is the block# here.
(doesn't matter how many blocks are mapped).

ext3_get_blocks_handle() returns number of blocks it mapped.  It returns 0
in case of HOLE.  ext3_getblk() should handle HOLE properly (currently its
dumping warning stack and returning -EIO).

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-08 10:22:50 -07:00
Mingming Cao 08fb306fe6 [PATCH] ext3 filesystem bogus ENOSPC with reservation fix
To handle the earlier bogus ENOSPC error caused by filesystem full of block
reservation, current code falls back to non block reservation, starts to
allocate block(s) from the goal allocation block group as if there is no
block reservation.

Current code needs to re-load the corresponding block group descriptor for
the initial goal block group in this case.  The patch fixes this.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:30 -07:00
Badari Pulavarty 0e31f51d81 [PATCH] ext3 -nobh option causes oops
For files other than IFREG, nobh option doesn't make sense.  Modifications
to them are journalled and needs buffer heads to do that.  Without this
patch, we get kernel oops in page_buffers().

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:44 -07:00
Neil Brown 2ccb48ebb4 [PATCH] ext3: avoid triggering ext3_error on bad NFS file handle
The inode number out of an NFS file handle gets passed eventually to
ext3_get_inode_block() without any checking.  If ext3_get_inode_block()
allows it to trigger an error, then bad filehandles can have unpleasant
effect - ext3_error() will usually cause a forced read-only remount, or a
panic if `errors=panic' was used.

So remove the call to ext3_error there and put a matching check in
ext3/namei.c where inode numbers are read off storage.

[akpm@osdl.org: fix off-by-one error]
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: <stable@kernel.org>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Cc: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:36 -07:00
Andreas Gruenbacher 36cf96f5e7 [PATCH] Remove leftover ext3 acl declarations
These functions no longer exist; remove their declarations.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:26 -07:00
Arjan van de Ven 5c81a4197d [PATCH] lockdep: annotate the quota code
The quota code plays interesting games with the lock ordering; to quote Jan:

| i_mutex of inode containing quota file is acquired after all other
| quota locks. i_mutex of all other inodes is acquired before quota
| locks. Quota code makes sure (by resetting inode operations and
| setting special flag on inode) that noone tries to enter quota code
| while holding i_mutex on a quota file...

The good news is that all of this special case i_mutex grabbing happens in the
(per filesystem) low level quota write function.  For this special case we
need a new I_MUTEX_* nesting level, since this just entirely outside any of
the regular VFS locking rules for i_mutex.  I trust Jan on his blue eyes that
this is not ever going to deadlock; and based on that the patch below is what
it takes to inform lockdep of these very interesting new locking rules.

The new locking rule for the I_MUTEX_QUOTA nesting level is that this is the
deepest possible level of nesting for i_mutex, and that this only should be
used in quota write (and possibly read) function of filesystems.  This makes
the lock ordering of the I_MUTEX_* levels:

I_MUTEX_PARENT -> I_MUTEX_CHILD -> I_MUTEX_NORMAL -> I_MUTEX_QUOTA

Has no effect on non-lockdep kernels.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:08 -07:00
Jörn Engel 6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Christoph Hellwig f5e54d6e53 [PATCH] mark address_space_operations const
Same as with already do with the file operations: keep them in .rodata and
prevents people from doing runtime patching.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28 14:59:04 -07:00
Badari Pulavarty ade1a29e16 [PATCH] ext3: Add "-o bh" option
This patch adds "-o bh" option to force use of buffer_heads.  This option
is needed when we make "nobh" as default - and if we run into problems.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
Johann Lombardi 92eeccd8ba [PATCH] ext3: cleanup dead code in ext3_add_entry()
The variables nlen and rlen are defined/initialized but not used in
ext3_add_entry().

Signed-off-by: Johann Lombardi <johann.lombardi@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:15 -07:00
Mingming Cao 43d23f9039 [PATCH] ext3_fsblk_t: the rest of in-kernel filesystem blocks conversion
Convert the ext3 in-kernel filesystem blocks to ext3_fsblk_t.  Convert the
rest of all unsigned long type in-kernel filesystem blocks to ext3_fsblk_t,
and replace the printk format string respondingly.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:10 -07:00
Mingming Cao 1c2bf374a4 [PATCH] ext3_fsblk_t: filesystem, group blocks and bug fixes
Some of the in-kernel ext3 block variable type are treated as signed 4 bytes
int type, thus limited ext3 filesystem to 8TB (4kblock size based).  While
trying to fix them, it seems quite confusing in the ext3 code where some
blocks are filesystem-wide blocks, some are group relative offsets that need
to be signed value (as -1 has special meaning).  So it seem saner to define
two types of physical blocks: one is filesystem wide blocks, another is
group-relative blocks.  The following patches clarify these two types of
blocks in the ext3 code, and fix the type bugs which limit current 32 bit ext3
filesystem limit to 8TB.

With this series of patches and the percpu counter data type changes in the mm
tree, we are able to extend exts filesystem limit to 16TB.

This work is also a pre-request for the recent >32 bit ext3 work, and makes
the kernel to able to address 48 bit ext3 block a lot easier: Simply redefine
ext3_fsblk_t from unsigned long to sector_t and redefine the format string for
ext3 filesystem block corresponding.

Two RFC with a series patches have been posted to ext2-devel list and have
been reviewed and discussed:
http://marc.theaimsgroup.com/?l=ext2-devel&m=114722190816690&w=2

http://marc.theaimsgroup.com/?l=ext2-devel&m=114784919525942&w=2

Patches are tested on both 32 bit machine and 64 bit machine, <8TB ext3 and
>8TB ext3 filesystem(with the latest to be released e2fsprogs-1.39).  Tests
includes overnight fsx, tiobench, dbench and fsstress.

This patch:

Defines ext3_fsblk_t and ext3_grpblk_t, and the printk format string for
filesystem wide blocks.

This patch classifies all block group relative blocks, and ext3_fsblk_t blocks
occurs in the same function where used to be confusing before.  Also include
kernel bug fixes for filesystem wide in-kernel block variables.  There are
some fileystem wide blocks are treated as int/unsigned int type in the kernel
currently, especially in ext3 block allocation and reservation code.  This
patch fixed those bugs by converting those variables to ext3_fsblk_t(unsigned
long) type.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:10 -07:00
Theodore Ts'o d2e5b13c4a [PATCH] ext3: remove inconsistent space before exclamation point in mount code
This was reported as Debian bug #336604.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:07 -07:00
Mingming Cao fcd5df3588 [PATCH] Avoid disk sector_t overflow for >2TB ext3 filesystem
If ext3 filesystem is larger than 2TB, and sector_t is a u32 (i.e.
CONFIG_LBD not defined in the kernel), the calculation of the disk sector
will overflow.  Add check at ext3_fill_super() and ext3_group_extend() to
prevent mount/remount/resize >2TB ext3 filesystem if sector_t size is 4
bytes.

Verified this patch on a 32 bit platform without CONFIG_LBD defined
(sector_t is 32 bits long), mount refuse to mount a 10TB ext3.

Signed-off-by: Mingming Cao<cmm@us.ibm.com>
Acked-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:07 -07:00
Mingming Cao 0216bfcffe [PATCH] percpu counter data type changes to suppport more than 2**31 ext3 free blocks counter
The percpu counter data type are changed in this set of patches to support
more users like ext3 who need more than 32 bit to store the free blocks
total in the filesystem.

- Generic perpcu counters data type changes.  The size of the global counter
  and local counter were explictly specified using s64 and s32.  The global
  counter is changed from long to s64, while the local counter is changed from
  long to s32, so we could avoid doing 64 bit update in most cases.

- Users of the percpu counters are updated to make use of the new
  percpu_counter_init() routine now taking an additional parameter to allow
  users to pass the initial value of the global counter.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:06 -07:00
Andrew Morton e6022603b9 [PATCH] ext3_clear_inode(): avoid kfree(NULL)
Steven Rostedt <rostedt@goodmis.org> points out that `rsv' here is usually
NULL, so we should avoid calling kfree().

Also, fix up some nearby whitespace damage.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:05 -07:00
David Howells 726c334223 [PATCH] VFS: Permit filesystem to perform statfs with a known root dentry
Give the statfs superblock operation a dentry pointer rather than a superblock
pointer.

This complements the get_sb() patch.  That reduced the significance of
sb->s_root, allowing NFS to place a fake root there.  However, NFS does
require a dentry to use as a target for the statfs operation.  This permits
the root in the vfsmount to be used instead.

linux/mount.h has been added where necessary to make allyesconfig build
successfully.

Interest has also been expressed for use with the FUSE and XFS filesystems.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:45 -07:00
David Howells 454e2398be [PATCH] VFS: Permit filesystem to override root dentry on mount
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.

The filesystem is then required to manually set the superblock and root dentry
pointers.  For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).

The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.

This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing.  In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.

The patch also makes the following changes:

 (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
     pointer argument and return an integer, so most filesystems have to change
     very little.

 (*) If one of the convenience function is not used, then get_sb() should
     normally call simple_set_mnt() to instantiate the vfsmount. This will
     always return 0, and so can be tail-called from get_sb().

 (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
     dcache upon superblock destruction rather than shrink_dcache_anon().

     This is required because the superblock may now have multiple trees that
     aren't actually bound to s_root, but that still need to be cleaned up. The
     currently called functions assume that the whole tree is rooted at s_root,
     and that anonymous dentries are not the roots of trees which results in
     dentries being left unculled.

     However, with the way NFS superblock sharing are currently set to be
     implemented, these assumptions are violated: the root of the filesystem is
     simply a dummy dentry and inode (the real inode for '/' may well be
     inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
     with child trees.

     [*] Anonymous until discovered from another tree.

 (*) The documentation has been adjusted, including the additional bit of
     changing ext2_* into foo_* in the documentation.

[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:45 -07:00
Linus Torvalds 2edc322d42 Merge git://git.infradead.org/~dwmw2/rbtree-2.6
* git://git.infradead.org/~dwmw2/rbtree-2.6:
  [RBTREE] Switch rb_colour() et al to en_US spelling of 'color' for consistency
  Update UML kernel/physmem.c to use rb_parent() accessor macro
  [RBTREE] Update hrtimers to use rb_parent() accessor macro.
  [RBTREE] Add explicit alignment to sizeof(long) for struct rb_node.
  [RBTREE] Merge colour and parent fields of struct rb_node.
  [RBTREE] Remove dead code in rb_erase()
  [RBTREE] Update JFFS2 to use rb_parent() accessor macro.
  [RBTREE] Update eventpoll.c to use rb_parent() accessor macro.
  [RBTREE] Update key.c to use rb_parent() accessor macro.
  [RBTREE] Update ext3 to use rb_parent() accessor macro.
  [RBTREE] Change rbtree off-tree marking in I/O schedulers.
  [RBTREE] Add accessor macros for colour and parent fields of rb_node
2006-06-20 14:51:22 -07:00
Andrew Morton 6855a3a6c3 [PATCH] ext3 resize: fix double unlock_super()
From: Andrew Morton <akpm@osdl.org>

Spotted by Jan Capek <jca@sysgo.com>

Cc: "Stephen C. Tweedie" <sct@redhat.com>
Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Jan Capek <jca@sysgo.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-31 16:27:10 -07:00
Mingming Cao 5dea5176e5 [PATCH] ext3: multile block allocate little endian fixes
Some places in ext3 multiple block allocation code (in 2.6.17-rc3) don't
handle the little endian well.  This was resulting in *wrong* block numbers
being assigned to in-memory block variables and then stored on disk
eventually.  The following patch has been verified to fix an ext3
filesystem failure when run ltp test on a 64 bit machine.

Signed-off-by; Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-03 20:05:41 -07:00
Al Viro a090d9132c [PATCH] protect ext3 ioctl modifying append_only, immutable, etc. with i_mutex
All modifications of ->i_flags in inodes that might be visible to
somebody else must be under ->i_mutex.  That patch fixes ext3 ioctl()
setting S_APPEND and friends.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26 07:52:21 -07:00
Al Viro de0bb97aff [PATCH] forgotten ->b_data in memcpy() call in ext3/resize.c (oopsable)
sbi->s_group_desc is an array of pointers to buffer_head.  memcpy() of
buffer size from address of buffer_head is a bad idea - it will generate
junk in any case, may oops if buffer_head is close to the end of slab
page and next page is not mapped and isn't what was intended there.
IOW, ->b_data is missing in that call.  Fortunately, result doesn't go
into the primary on-disk data structures, so only backup ones get crap
written to them; that had allowed this bug to remain unnoticed until
now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26 07:52:21 -07:00
David Woodhouse 52b5108ca7 [RBTREE] Update ext3 to use rb_parent() accessor macro.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-04-21 13:15:57 +01:00
Ananiev, Leonid I 75616cf985 [PATCH] ext3: Fix missed mutex unlock
Missed unlock_super()call is added in error condition code path.

Signed-off-by: Leonid Ananiev <leonid.i.ananiev@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-04-17 14:24:57 -07:00
Ananiev, Leonid I 389ed39b97 [PATCH] ext3: Fix missed mutex unlock
Missed unlock_super()call is added in error condition code path.

Signed-off-by: Leonid Ananiev <leonid.i.ananiev@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:46 -07:00
Jens Axboe 5274f052e7 [PATCH] Introduce sys_splice() system call
This adds support for the sys_splice system call. Using a pipe as a
transport, it can connect to files or sockets (latter as output only).

From the splice.c comments:

   "splice": joining two ropes together by interweaving their strands.

   This is the "extended pipe" functionality, where a pipe is used as
   an arbitrary in-memory buffer. Think of a pipe as a small kernel
   buffer that you can use to transfer data from one end to the other.

   The traditional unix read/write is extended with a "splice()" operation
   that transfers data buffers to or from a pipe buffer.

   Named by Larry McVoy, original implementation from Linus, extended by
   Jens to support splicing to files and fixing the initial implementation
   bugs.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-30 12:28:18 -08:00
Arjan van de Ven 4b6f5d20b0 [PATCH] Make most file operations structs in fs/ const
This is a conversion to make the various file_operations structs in fs/
const.  Basically a regexp job, with a few manual fixups

The goal is both to increase correctness (harder to accidentally write to
shared datastructures) and reducing the false sharing of cachelines with
things that get dirty in .data (while .rodata is nicely read only and thus
cache clean)

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:06 -08:00
Badari Pulavarty a0e9285233 [PATCH] ext3: "nobh" writeback support for filesystems blocksize < pagesize
There is no valid reason why we can't support "nobh" option for filesystems
with blocksize != PAGESIZE.

This patch lets them use "nobh" option for writeback mode for blocksize <
pagesize.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Badari Pulavarty f91a2ad2ed [PATCH] ext3: multi-block get_block()
Mingming Cao recently added multi-block allocation support for ext3,
currently used only by DIO.  I added support to map multiple blocks for
mpage_readpages().  This patch add support for ext3_get_block() to deal
with multi-block mapping.  Basically it renames ext3_direct_io_get_blocks()
as ext3_get_block().

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Andrew Morton d6859bfca8 [PATCH] ext3: cleanups and WARN_ON()
- Clean up a few little layout things and comments.

- Add a WARN_ON to a case which I was wondering about.

- Tune up some inlines.

Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Badari Pulavarty 1d8fa7a2b9 [PATCH] remove ->get_blocks() support
Now that get_block() can handle mapping multiple disk blocks, no need to have
->get_blocks().  This patch removes fs specific ->get_blocks() added for DIO
and makes it users use get_block() instead.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:01 -08:00
Mingming Cao d48589bfad [PATCH] ext3_get_blocks: Adjust reservation window size for mblocks
Optimize the block reservation and the multiple block allocation: with the
knowledge of the total number of blocks ahead, set or adjust the reservation
window size properly (based on the number of blocks needed) before block
allocation happens: if there isn't any reservation yet, make sure the
reservation window equals to or greater than the number of blocks needed,
before create an reservation window; if a reservation window is already
exists, try to extends the window size to match the number of blocks to
allocate.  This could increase the possibility of completing multiple blocks
allocation in a single request, as blocks are only allocated in the range of
the inode's reservation window.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:01 -08:00
Mingming Cao faa569763a [PATCH] ext3_get_blocks: Adjust accounting info in ext3_new_blocks()
Update accounting information (quota, boundary checks, free blocks number etc)
in ext3_new_blocks().

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:01 -08:00
Mingming Cao b54e41ec17 [PATCH] ext3_get_blocks: support multiple blocks allocation in ext3_new_block()
Change ext3_try_to_allocate() (called via ext3_new_blocks()) to try to
allocate the requested number of blocks on a best effort basis: After
allocated the first block, it will always attempt to allocate the next few(up
to the requested size and not beyond the reservation window) adjacent blocks
at the same time.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:01 -08:00
Mingming Cao b47b24781c [PATCH] ext3_get_blocks: multiple block allocation
Add support for multiple block allocation in ext3-get-blocks().

Look up the disk block mapping and count the total number of blocks to
allocate, then pass it to ext3_new_block(), where the real block allocation is
performed.  Once multiple blocks are allocated, prepare the branch with those
just allocated blocks info and finally splice the whole branch into the block
mapping tree.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:01 -08:00
Mingming Cao 89747d369d [PATCH] ext3_get_blocks: Mapping multiple blocks at a once
Currently ext3_get_block() only maps or allocates one block at a time.  This
is quite inefficient for sequential IO workload.

I have posted a early implements a simply multiple block map and allocation
with current ext3.  The basic idea is allocating the 1st block in the existing
way, and attempting to allocate the next adjacent blocks on a best effort
basis.  More description about the implementation could be found here:
http://marc.theaimsgroup.com/?l=ext2-devel&m=112162230003522&w=2

The following the latest version of the patch: break the original patch into 5
patches, re-worked some logicals, and fixed some bugs.  The break ups are:

 [patch 1] Adding map multiple blocks at a time in ext3_get_blocks()
 [patch 2] Extend ext3_get_blocks() to support multiple block allocation
 [patch 3] Implement multiple block allocation in ext3-try-to-allocate
 (called via ext3_new_block()).
 [patch 4] Proper accounting updates in ext3_new_blocks()
 [patch 5] Adjust reservation window size properly (by the given number
 of blocks to allocate) before block allocation to increase the
 possibility of allocating multiple blocks in a single call.

Tests done so far includes fsx,tiobench and dbench.  The following numbers
collected from Direct IO tests (1G file creation/read) shows the system time
have been greatly reduced (more than 50% on my 8 cpu system) with the patches.

 1G file DIO write:
 	2.6.15		2.6.15+patches
 real    0m31.275s	0m31.161s
 user    0m0.000s	0m0.000s
 sys     0m3.384s	0m0.564s

 1G file DIO read:
 	2.6.15		2.6.15+patches
 real    0m30.733s	0m30.624s
 user    0m0.000s	0m0.004s
 sys     0m0.748s	0m0.380s

Some previous test we did on buffered IO with using multiple blocks allocation
and delayed allocation shows noticeable improvement on throughput and system
time.

This patch:

Add support of mapping multiple blocks in one call.

This is useful for DIO reads and re-writes (where blocks are already
allocated), also is in line with Christoph's proposal of using getblocks() in
mpage_readpage() or mpage_readpages().

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:00 -08:00
NeilBrown 2ff28e22bd [PATCH] Make address_space_operations->invalidatepage return void
The return value of this function is never used, so let's be honest and
declare it as void.

Some places where invalidatepage returned 0, I have inserted comments
suggesting a BUG_ON.

[akpm@osdl.org: JBD BUG fix]
[akpm@osdl.org: rework for git-nfs]
[akpm@osdl.org: don't go BUG in block_invalidate_page()]
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:56:55 -08:00
Kirk True c1f5a19446 [PATCH] ext3: Fix debug logging-only compilation error
When EXT3FS_DEBUG is #define-d, the compile breaks due to #include file
issues.

Signed-off-by: Kirk True <kernel@kirkandsheila.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:56 -08:00
Glauber de Oliveira Costa b5a7c4f583 [PATCH] ext3: Properly report backup block present in a group
In filesystems with the meta block group flag on, ext3_bg_num_gdb() fails
to report the correct number of blocks used to store the group descriptor
backups in a given group.  It happens because meta_bg follows a different
logic from the original ext3 backup placement in groups multiples of 3, 5
and 7.

Signed-off-by: Glauber de Oliveira Costa <glommer@br.ibm.com>
Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Cc: Alex Tomas <alex@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:30 -08:00
Alex Tomas 09fe316a7b [PATCH] fast ext3_statfs
Under I/O load it may take up to a dozen seconds to read all group
descriptors.  This is what ext3_statfs() does.  At the same time, we already
maintain global numbers of free inodes/blocks.  Why don't we use them instead
of group reading and summing?

Cc: Ravikiran G Thirumalai <kiran@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:25 -08:00
Paul Jackson fffb60f93c [PATCH] cpuset memory spread: slab cache format
Rewrap the overly long source code lines resulting from the previous
patch's addition of the slab cache flag SLAB_MEM_SPREAD.  This patch
contains only formatting changes, and no function change.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:23 -08:00
Paul Jackson 4b6a9316fa [PATCH] cpuset memory spread: slab cache filesystems
Mark file system inode and similar slab caches subject to SLAB_MEM_SPREAD
memory spreading.

If a slab cache is marked SLAB_MEM_SPREAD, then anytime that a task that's
in a cpuset with the 'memory_spread_slab' option enabled goes to allocate
from such a slab cache, the allocations are spread evenly over all the
memory nodes (task->mems_allowed) allowed to that task, instead of favoring
allocation on the node local to the current cpu.

The following inode and similar caches are marked SLAB_MEM_SPREAD:

    file                               cache
    ====                               =====
    fs/adfs/super.c                    adfs_inode_cache
    fs/affs/super.c                    affs_inode_cache
    fs/befs/linuxvfs.c                 befs_inode_cache
    fs/bfs/inode.c                     bfs_inode_cache
    fs/block_dev.c                     bdev_cache
    fs/cifs/cifsfs.c                   cifs_inode_cache
    fs/coda/inode.c                    coda_inode_cache
    fs/dquot.c                         dquot
    fs/efs/super.c                     efs_inode_cache
    fs/ext2/super.c                    ext2_inode_cache
    fs/ext2/xattr.c (fs/mbcache.c)     ext2_xattr
    fs/ext3/super.c                    ext3_inode_cache
    fs/ext3/xattr.c (fs/mbcache.c)     ext3_xattr
    fs/fat/cache.c                     fat_cache
    fs/fat/inode.c                     fat_inode_cache
    fs/freevxfs/vxfs_super.c           vxfs_inode
    fs/hpfs/super.c                    hpfs_inode_cache
    fs/isofs/inode.c                   isofs_inode_cache
    fs/jffs/inode-v23.c                jffs_fm
    fs/jffs2/super.c                   jffs2_i
    fs/jfs/super.c                     jfs_ip
    fs/minix/inode.c                   minix_inode_cache
    fs/ncpfs/inode.c                   ncp_inode_cache
    fs/nfs/direct.c                    nfs_direct_cache
    fs/nfs/inode.c                     nfs_inode_cache
    fs/ntfs/super.c                    ntfs_big_inode_cache_name
    fs/ntfs/super.c                    ntfs_inode_cache
    fs/ocfs2/dlm/dlmfs.c               dlmfs_inode_cache
    fs/ocfs2/super.c                   ocfs2_inode_cache
    fs/proc/inode.c                    proc_inode_cache
    fs/qnx4/inode.c                    qnx4_inode_cache
    fs/reiserfs/super.c                reiser_inode_cache
    fs/romfs/inode.c                   romfs_inode_cache
    fs/smbfs/inode.c                   smb_inode_cache
    fs/sysv/inode.c                    sysv_inode_cache
    fs/udf/super.c                     udf_inode_cache
    fs/ufs/super.c                     ufs_inode_cache
    net/socket.c                       sock_inode_cache
    net/sunrpc/rpc_pipe.c              rpc_inode_cache

The choice of which slab caches to so mark was quite simple.  I marked
those already marked SLAB_RECLAIM_ACCOUNT, except for fs/xfs, dentry_cache,
inode_cache, and buffer_head, which were marked in a previous patch.  Even
though SLAB_RECLAIM_ACCOUNT is for a different purpose, it marks the same
potentially large file system i/o related slab caches as we need for memory
spreading.

Given that the rule now becomes "wherever you would have used a
SLAB_RECLAIM_ACCOUNT slab cache flag before (usually the inode cache), use
the SLAB_MEM_SPREAD flag too", this should be easy enough to maintain.
Future file system writers will just copy one of the existing file system
slab cache setups and tend to get it right without thinking.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:23 -08:00
Arjan van de Ven 9746151861 [PATCH] convert ext3's truncate_sem to a mutex
ext3's truncate_sem is always released in the same function it's taken
and it otherwise is a mutex as well..

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:14 -08:00
Ingo Molnar d3be915fc5 [PATCH] sem2mutex: quota
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:11 -08:00
Andrew Morton d8733c2956 [PATCH] ext3_readdir: use generic readahead
Linus points out that ext3_readdir's readahead only cuts in when
ext3_readdir() is operating at the very start of the directory.  So for large
directories we end up performing no readahead at all and we suck.

So take it all out and use the core VM's page_cache_readahead().  This means
that ext3 directory reads will use all of readahead's dynamic sizing goop.

Note that we're using the directory's filp->f_ra to hold the readahead state,
but readahead is actually being performed against the underlying blockdev's
address_space.  Fortunately the readahead code is all set up to handle this.

Tested with printk.  It works.  I was struggling to find a real workload which
actually cared.

(The patch also exports page_cache_readahead() to GPL modules)

Cc: "Stephen C. Tweedie" <sct@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:09 -08:00
Badari Pulavarty cd6ef84e6a [PATCH] ext3: fix nobh mode for chattr +j inodes
One can do "chattr +j" on a file to change its journalling mode.  Fix
writeback mode with "nobh" handling for it.

Even though, we mount ext3 filesystem in writeback mode with "nobh" option,
some one can do "chattr +j" on a single file to force it to do journalled
mode.  In order to do journaling, ext3_block_truncate_page() need to
fallback to default case of creating buffers and adding them to transaction
etc.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-11 09:19:34 -08:00