Commit Graph

750 Commits

Author SHA1 Message Date
Dan Magenheimer 9fdfdcf171 fs: add field to superblock to support cleancache
This second patch of eight in this cleancache series adds a field to
the generic superblock to squirrel away a pool identifier that is
dynamically provided by cleancache-enabled filesystems at mount time
to uniquely identify files and pages belonging to this mounted filesystem.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v8: trivial merge conflict update]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
2011-05-26 10:01:19 -06:00
Tim Gardner 0ac1ee0bfe ulimit: raise default hard ulimit on number of files to 4096
Apps are increasingly using more than 1024 file descriptors.  See
discussion in several distro bug trackers, e.g.  BugLink:
http://bugs.launchpad.net/bugs/663090
https://issues.rpath.com/browse/RPL-2054

You don't want to raise the default soft limit, since that might break
apps that use select(), but it's safe to raise the default hard limit;
that way, apps that know they need lots of file descriptors can raise
their soft limit without needing root, and without user intervention.

Ubuntu is doing this with a kernel change because they have a policy of
not changing kernel defaults in userland.

While 4096 might not be enough for *all* apps, it seems to be plenty for
the apps I've seen lately that are unhappy with 1024.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Cc: Dan Kegel <dank@kegel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:43 -07:00
Peter Zijlstra 3d48ae45e7 mm: Convert i_mmap_lock to a mutex
Straightforward conversion of i_mmap_lock to a mutex.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:18 -07:00
Peter Zijlstra 97a894136f mm: Remove i_mmap_lock lockbreak
Hugh says:
 "The only significant loser, I think, would be page reclaim (when
  concurrent with truncation): could spin for a long time waiting for
  the i_mmap_mutex it expects would soon be dropped? "

Counter points:
 - cpu contention makes the spin stop (need_resched())
 - zap pages should be freeing pages at a higher rate than reclaim
   ever can

I think the simplification of the truncate code is definitely worth it.

Effectively reverts: 2aa15890f3 ("mm: prevent concurrent
unmap_mapping_range() on the same inode") and takes out the code that
caused its problem.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:17 -07:00
Linus Torvalds eed631e0d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix FS_IOC_SETFLAGS ioctl
  Btrfs: fix FS_IOC_GETFLAGS ioctl
  fs: remove FS_COW_FL
  Btrfs: fix easily get into ENOSPC in mixed case
  Prevent oopsing in posix_acl_valid()
2011-05-15 10:22:10 -07:00
Li Zefan e1e8fb6a1f fs: remove FS_COW_FL
FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file
COW in btrfs, but FS_NOCOW_FL is sufficient.

The fact is we don't have corresponding BTRFS_INODE_COW flag.

COW is default, and FS_NOCOW_FL can be used to switch off COW for
a single file.

If we mount btrfs with nodatacow, a newly created file will be set with
the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the
FS_NOCOW_FL flag.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14 16:10:26 -04:00
Linus Torvalds 0bba01695b vfs: Re-introduce s_uuid in the superblock
Gaah.  When commit be85bccaa5 reverted the export of file system uuid
via /proc/<pid>/mountinfo, it also unintentionally removed the s_uuid
field in struct super_block.

I didn't mean to do that, since filesystems have been taught to fill it
in (and we want to keep it for future re-introduction in the mountinfo
file).

Stupid of me. This adds it back in.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-12 15:21:04 -07:00
Linus Torvalds be85bccaa5 Revert "vfs: Export file system uuid via /proc/<pid>/mountinfo"
This reverts commit 93f1c20bc8.

It turns out that libmount misparses it because it adds a '-' character
in the uuid string, which libmount then incorrectly confuses with the
separator string (" - ") at the end of all the optional arguments.

Upstream libmount (in the util-linux tree) has been fixed, but until
that fix actually percolates up to users, we'd better not expose this
change in the kernel.

Let's revisit this later (possibly by exposing the UUID without any '-'
characters in it, avoiding the user-space bug).

Reported-by: Dave Jones <davej@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Karel Zak <kzak@redhat.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-12 13:35:56 -07:00
Linus Torvalds 42933bac11 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
  Fix common misspellings
2011-04-07 11:14:49 -07:00
Jens Axboe 7dcda1c96d fs: export empty_aops
With the ->sync_page() hook gone, we have a few users that
add their own static address_space_operations without any
functions defined.

fs/inode.c already has an empty_aops that it uses for init
purposes. Lets export that and use it in the places where
an otherwise empty aops was defined.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-05 23:51:48 +02:00
Lucas De Marchi 25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Linus Torvalds 212a17ab87 Merge branch 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (45 commits)
  Btrfs: fix __btrfs_map_block on 32 bit machines
  btrfs: fix possible deadlock by clearing __GFP_FS flag
  btrfs: check link counter overflow in link(2)
  btrfs: don't mess with i_nlink of unlocked inode in rename()
  Btrfs: check return value of btrfs_alloc_path()
  Btrfs: fix OOPS of empty filesystem after balance
  Btrfs: fix memory leak of empty filesystem after balance
  Btrfs: fix return value of setflags ioctl
  Btrfs: fix uncheck memory allocations
  btrfs: make inode ref log recovery faster
  Btrfs: add btrfs_trim_fs() to handle FITRIM
  Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes
  Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP
  Btrfs: make update_reserved_bytes() public
  btrfs: return EXDEV when linking from different subvolumes
  Btrfs: Per file/directory controls for COW and compression
  Btrfs: add datacow flag in inode flag
  btrfs: use GFP_NOFS instead of GFP_KERNEL
  Btrfs: check return value of read_tree_block()
  btrfs: properly access unaligned checksum buffer
  ...

Fix up trivial conflicts in fs/btrfs/volumes.c due to plug removal in
the block layer.
2011-03-28 15:31:05 -07:00
liubo 32471f6e19 Btrfs: add datacow flag in inode flag
For datacow control, the corresponding inode flags are needed.
This is for btrfs use.

v1->v2:
Change FS_COW_FL to another bit due to conflict with the upstream e2fsprogs

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:40 -04:00
Linus Torvalds d39dd11c3e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  fs: simplify iget & friends
  fs: pull inode->i_lock up out of writeback_single_inode
  fs: rename inode_lock to inode_hash_lock
  fs: move i_wb_list out from under inode_lock
  fs: move i_sb_list out from under inode_lock
  fs: remove inode_lock from iput_final and prune_icache
  fs: Lock the inode LRU list separately
  fs: factor inode disposal
  fs: protect inode->i_state with inode->i_lock
  autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
  autofs4 - remove autofs4_lock
  autofs4 - fix d_manage() return on rcu-walk
  autofs4 - fix autofs4_expire_indirect() traversal
  autofs4 - fix dentry leak in autofs4_expire_direct()
  autofs4 - reinstate last used update on access
  vfs - check non-mountpoint dentry might block in __follow_mount_rcu()
2011-03-24 19:01:30 -07:00
Dave Chinner 250df6ed27 fs: protect inode->i_state with inode->i_lock
Protect inode state transitions and validity checks with the
inode->i_lock. This enables us to make inode state transitions
independently of the inode_lock and is the first step to peeling
away the inode_lock from the code.

This requires that __iget() is done atomically with i_state checks
during list traversals so that we don't race with another thread
marking the inode I_FREEING between the state check and grabbing the
reference.

Also remove the unlock_new_inode() memory barrier optimisation
required to avoid taking the inode_lock when clearing I_NEW.
Simplify the code by simply taking the inode->i_lock around the
state change and wakeup. Because the wakeup is no longer tricky,
remove the wake_up_inode() function and open code the wakeup where
necessary.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-24 21:16:31 -04:00
Linus Torvalds 6c51038900 Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
  Documentation/iostats.txt: bit-size reference etc.
  cfq-iosched: removing unnecessary think time checking
  cfq-iosched: Don't clear queue stats when preempt.
  blk-throttle: Reset group slice when limits are changed
  blk-cgroup: Only give unaccounted_time under debug
  cfq-iosched: Don't set active queue in preempt
  block: fix non-atomic access to genhd inflight structures
  block: attempt to merge with existing requests on plug flush
  block: NULL dereference on error path in __blkdev_get()
  cfq-iosched: Don't update group weights when on service tree
  fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
  block: Require subsystems to explicitly allocate bio_set integrity mempool
  jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
  jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
  fs: make fsync_buffers_list() plug
  mm: make generic_writepages() use plugging
  blk-cgroup: Add unaccounted time to timeslice_used.
  block: fixup plugging stubs for !CONFIG_BLOCK
  block: remove obsolete comments for blkdev_issue_zeroout.
  blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
  ...

Fix up conflicts in fs/{aio.c,super.c}
2011-03-24 10:16:26 -07:00
Serge E. Hallyn 2e14967075 userns: rename is_owner_or_cap to inode_owner_or_capable
And give it a kernel-doc comment.

[akpm@linux-foundation.org: btrfs changed in linux-next]
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:47:13 -07:00
Serge E. Hallyn e795b71799 userns: userns: check user namespace for task->file uid equivalence checks
Cheat for now and say all files belong to init_user_ns.  Next step will be
to let superblocks belong to a user_ns, and derive inode_userns(inode)
from inode->i_sb->s_user_ns.  Finally we'll introduce more flexible
arrangements.

Changelog:
	Feb 15: make is_owner_or_cap take const struct inode
	Feb 23: make is_owner_or_cap bool

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:47:08 -07:00
Richard Kennedy f3ccfcdaf3 fs.h: remove 8 bytes of padding from block_device on 64bit builds
Re-ordering struct block_inode to remove 8 bytes of padding on 64 bit
builds, which also shrinks bdev_inode by 8 bytes (776 -> 768) allowing it
to fit into one fewer cache lines.

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:10 -07:00
Al Viro 474a00ee13 kill simple_set_mnt()
not needed anymore, since all users (->get_sb() instances) are gone.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-17 21:31:32 -04:00
Linus Torvalds 054cfaacf8 Merge branch 'mnt_devname' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'mnt_devname' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  vfs: bury ->get_sb()
  nfs: switch NFS from ->get_sb() to ->mount()
  nfs: stop mangling ->mnt_devname on NFS
  vfs: new superblock methods to override /proc/*/mount{s,info}
  nfs: nfs_do_{ref,sub}mount() superblock argument is redundant
  nfs: make nfs_path() work without vfsmount
  nfs: store devname at disconnected NFS roots
  nfs: propagate devname to nfs{,4}_get_root()
2011-03-16 19:09:57 -07:00
Al Viro 1a102ff925 vfs: bury ->get_sb()
This is an ex-parrot.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-16 16:48:06 -04:00
Al Viro c7f404b40a vfs: new superblock methods to override /proc/*/mount{s,info}
a) ->show_devname(m, mnt) - what to put into devname columns in mounts,
mountinfo and mountstats
b) ->show_path(m, mnt) - what to put into relative path column in mountinfo

Leaving those NULL gives old behaviour.  NFS switched to using those.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-16 16:48:06 -04:00
Linus Torvalds 0f6e0e8448 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits)
  AppArmor: kill unused macros in lsm.c
  AppArmor: cleanup generated files correctly
  KEYS: Add an iovec version of KEYCTL_INSTANTIATE
  KEYS: Add a new keyctl op to reject a key with a specified error code
  KEYS: Add a key type op to permit the key description to be vetted
  KEYS: Add an RCU payload dereference macro
  AppArmor: Cleanup make file to remove cruft and make it easier to read
  SELinux: implement the new sb_remount LSM hook
  LSM: Pass -o remount options to the LSM
  SELinux: Compute SID for the newly created socket
  SELinux: Socket retains creator role and MLS attribute
  SELinux: Auto-generate security_is_socket_class
  TOMOYO: Fix memory leak upon file open.
  Revert "selinux: simplify ioctl checking"
  selinux: drop unused packet flow permissions
  selinux: Fix packet forwarding checks on postrouting
  selinux: Fix wrong checks for selinux_policycap_netpeer
  selinux: Fix check for xfrm selinux context algorithm
  ima: remove unnecessary call to ima_must_measure
  IMA: remove IMA imbalance checking
  ...
2011-03-16 09:15:43 -07:00
Al Viro 1abf0c718f New kind of open files - "location only".
New flag for open(2) - O_PATH.  Semantics:
	* pathname is resolved, but the file itself is _NOT_ opened
as far as filesystem is concerned.
	* almost all operations on the resulting descriptors shall
fail with -EBADF.  Exceptions are:
	1) operations on descriptors themselves (i.e.
		close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
		fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
		fcntl(fd, F_SETFD, ...))
	2) fcntl(fd, F_GETFL), for a common non-destructive way to
		check if descriptor is open
	3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
		points of pathname resolution
	* closing such descriptor does *NOT* affect dnotify or
posix locks.
	* permissions are checked as usual along the way to file;
no permission checks are applied to the file itself.  Of course,
giving such thing to syscall will result in permission checks (at
the moment it means checking that starting point of ....at() is
a directory and caller has exec permissions on it).

fget() and fget_light() return NULL on such descriptors; use of
fget_raw() and fget_raw_light() is needed to get them.  That protects
existing code from dealing with those things.

There are two things still missing (they come in the next commits):
one is handling of symlinks (right now we refuse to open them that
way; see the next commit for semantics related to those) and another
is descriptor passing via SCM_RIGHTS datagrams.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-15 02:21:45 -04:00
Aneesh Kumar K.V 93f1c20bc8 vfs: Export file system uuid via /proc/<pid>/mountinfo
We add a per superblock uuid field. File systems should
update the uuid in the fill_super callback

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-15 02:21:45 -04:00
Aneesh Kumar K.V 990d6c2d7a vfs: Add name to file handle conversion support
The syscall also return mount id which can be used
to lookup file system specific information such as uuid
in /proc/<pid>/mountinfo

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-15 02:21:37 -04:00
Al Viro c8b91accfa clean statfs-like syscalls up
New helpers: user_statfs() and fd_statfs(), taking userland pathname and
descriptor resp. and filling struct kstatfs.  Syscalls of statfs family
(native, compat and foreign - osf and hpux on alpha and parisc resp.)
switched to those.  Removes some boilerplate code, simplifies cleanup
on errors...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14 09:15:28 -04:00
Al Viro 73d049a40f open-style analog of vfs_path_lookup()
new function: file_open_root(dentry, mnt, name, flags) opens the file
vfs_path_lookup would arrive to.

Note that name can be empty; in that case the usual requirement that
dentry should be a directory is lifted.

open-coded equivalents switched to it, may_open() got down exactly
one caller and became static.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14 09:15:28 -04:00
Al Viro 47c805dc2d switch do_filp_open() to struct open_flags
take calculation of open_flags by open(2) arguments into new helper
in fs/open.c, move filp_open() over there, have it and do_sys_open()
use that helper, switch exec.c callers of do_filp_open() to explicit
(and constant) struct open_flags.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14 09:15:25 -04:00
Jens Axboe 4c63f5646e Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/core
Conflicts:
	block/blk-core.c
	block/blk-flush.c
	drivers/md/raid1.c
	drivers/md/raid10.c
	drivers/md/raid5.c
	fs/nilfs2/btnode.c
	fs/nilfs2/mdt.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:58:35 +01:00
Jens Axboe 721a9602e6 block: kill off REQ_UNPLUG
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:52:27 +01:00
Jens Axboe 7eaceaccab block: remove per-queue plugging
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:52:07 +01:00
James Morris 1cc26bada9 Merge branch 'master'; commit 'v2.6.38-rc7' into next 2011-03-08 10:55:06 +11:00
Linus Torvalds 638691a7a4 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  md: Fix - again - partition detection when array becomes active
  Fix over-zealous flush_disk when changing device size.
  md: avoid spinlock problem in blk_throtl_exit
  md: correctly handle probe of an 'mdp' device.
  md: don't set_capacity before array is active.
  md: Fix raid1->raid0 takeover
2011-02-25 11:13:26 -08:00
NeilBrown 93b270f76e Fix over-zealous flush_disk when changing device size.
There are two cases when we call flush_disk.
In one, the device has disappeared (check_disk_change) so any
data will hold becomes irrelevant.
In the oter, the device has changed size (check_disk_size_change)
so data we hold may be irrelevant.

In both cases it makes sense to discard any 'clean' buffers,
so they will be read back from the device if needed.

In the former case it makes sense to discard 'dirty' buffers
as there will never be anywhere safe to write the data.  In the
second case it *does*not* make sense to discard dirty buffers
as that will lead to file system corruption when you simply enlarge
the containing devices.

flush_disk calls __invalidate_devices.
__invalidate_device calls both invalidate_inodes and invalidate_bdev.

invalidate_inodes *does* discard I_DIRTY inodes and this does lead
to fs corruption.

invalidate_bev *does*not* discard dirty pages, but I don't really care
about that at present.

So this patch adds a flag to __invalidate_device (calling it
__invalidate_device2) to indicate whether dirty buffers should be
killed, and this is passed to invalidate_inodes which can choose to
skip dirty inodes.

flusk_disk then passes true from check_disk_change and false from
check_disk_size_change.

dm avoids tripping over this problem by calling i_size_write directly
rathher than using check_disk_size_change.

md does use check_disk_size_change and so is affected.

This regression was introduced by commit 608aeef17a which causes
check_disk_size_change to call flush_disk, so it is suitable for any
kernel since 2.6.27.

Cc: stable@kernel.org
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Andrew Patterson <andrew.patterson@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-02-24 17:25:47 +11:00
Miklos Szeredi 2aa15890f3 mm: prevent concurrent unmap_mapping_range() on the same inode
Michael Leun reported that running parallel opens on a fuse filesystem
can trigger a "kernel BUG at mm/truncate.c:475"

Gurudas Pai reported the same bug on NFS.

The reason is, unmap_mapping_range() is not prepared for more than
one concurrent invocation per inode.  For example:

  thread1: going through a big range, stops in the middle of a vma and
     stores the restart address in vm_truncate_count.

  thread2: comes in with a small (e.g. single page) unmap request on
     the same vma, somewhere before restart_address, finds that the
     vma was already unmapped up to the restart address and happily
     returns without doing anything.

Another scenario would be two big unmap requests, both having to
restart the unmapping and each one setting vm_truncate_count to its
own value.  This could go on forever without any of them being able to
finish.

Truncate and hole punching already serialize with i_mutex.  Other
callers of unmap_mapping_range() do not, and it's difficult to get
i_mutex protection for all callers.  In particular ->d_revalidate(),
which calls invalidate_inode_pages2_range() in fuse, may be called
with or without i_mutex.

This patch adds a new mutex to 'struct address_space' to prevent
running multiple concurrent unmap_mapping_range() on the same mapping.

[ We'll hopefully get rid of all this with the upcoming mm
  preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
  lockbreak" patch in particular.  But that is for 2.6.39 ]

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Michael Leun <lkml20101129@newton.leun.net>
Reported-by: Gurudas Pai <gurudas.pai@oracle.com>
Tested-by: Gurudas Pai <gurudas.pai@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-23 19:52:52 -08:00
Mimi Zohar a5c96ebf1d IMA: define readcount functions
Define i_readcount_inc/dec() functions to be called from the VFS layer.

Changelog:
- renamed iget/iput_readcount to i_readcount_inc/dec (Dave Chinner's suggestion)
- removed i_lock in iput_readcount() (based on comments:Dave Chinner,Eric Paris)

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Eric Paris <eparis@redhat.com>
2011-02-10 07:51:43 -05:00
Mimi Zohar a68a27b6f2 IMA: convert i_readcount to atomic
Convert the inode's i_readcount from an unsigned int to atomic.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Eric Paris <eparis@redhat.com>
2011-02-10 07:51:43 -05:00
Namhyung Kim 3cd90ea42f vfs: sparse: add __FMODE_EXEC
FMODE_EXEC is a constant type of fmode_t but was used with normal integer
constants.  This results in following warnings from sparse.  Fix it using
new macro __FMODE_EXEC.

 fs/exec.c:116:58: warning: restricted fmode_t degrades to integer
 fs/exec.c:689:58: warning: restricted fmode_t degrades to integer
 fs/fcntl.c:777:9: warning: restricted fmode_t degrades to integer

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:19 -08:00
Namhyung Kim 1a44bc8c7c vfs: sparse: remove a warning on OPEN_FMODE()
AND-ing FMODE_* constant with normal integer results in following
sparse warnings. Fix it.

 fs/open.c:662:21: warning: restricted fmode_t degrades to integer
 fs/anon_inodes.c:123:34: warning: restricted fmode_t degrades to integer

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-02 16:03:19 -08:00
Namhyung Kim ecf5632dd1 fs: fix address space warnings in ioctl_fiemap()
The fi_extents_start field of struct fiemap_extent_info is a
user pointer but was not marked as __user. This makes sparse
emit following warnings:

  CHECK   fs/ioctl.c
fs/ioctl.c:114:26: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:114:26:    expected void [noderef] <asn:1>*dst
fs/ioctl.c:114:26:    got struct fiemap_extent *[assigned] dest
fs/ioctl.c:202:14: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:202:14:    expected void const volatile [noderef] <asn:1>*<noident>
fs/ioctl.c:202:14:    got struct fiemap_extent *[assigned] fi_extents_start
fs/ioctl.c:212:27: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:212:27:    expected void [noderef] <asn:1>*dst
fs/ioctl.c:212:27:    got char *<noident>

Also add 'ufiemap' variable to eliminate unnecessary casts.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 08:21:42 -05:00
Christoph Hellwig 2fe17c1075 fallocate should be a file operation
Currently all filesystems except XFS implement fallocate asynchronously,
while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
I/O we really want our allocation on disk, especially for the !KEEP_SIZE
case where we actually grow the file with user-visible zeroes.  On the
other hand always commiting the transaction is a bad idea for fast-path
uses of fallocate like for example in recent Samba versions.   Given
that block allocation is a data plane operation anyway change it from
an inode operation to a file operation so that we have the file structure
available that lets us check for O_SYNC.

This also includes moving the code around for a few of the filesystems,
and remove the already unnedded S_ISDIR checks given that we only wire
up fallocate for regular files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 02:25:31 -05:00
Linus Torvalds f8206b925f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (23 commits)
  sanitize vfsmount refcounting changes
  fix old umount_tree() breakage
  autofs4: Merge the remaining dentry ops tables
  Unexport do_add_mount() and add in follow_automount(), not ->d_automount()
  Allow d_manage() to be used in RCU-walk mode
  Remove a further kludge from __do_follow_link()
  autofs4: Bump version
  autofs4: Add v4 pseudo direct mount support
  autofs4: Fix wait validation
  autofs4: Clean up autofs4_free_ino()
  autofs4: Clean up dentry operations
  autofs4: Clean up inode operations
  autofs4: Remove unused code
  autofs4: Add d_manage() dentry operation
  autofs4: Add d_automount() dentry operation
  Remove the automount through follow_link() kludge code from pathwalk
  CIFS: Use d_automount() rather than abusing follow_link()
  NFS: Use d_automount() rather than abusing follow_link()
  AFS: Use d_automount() rather than abusing follow_link()
  Add an AT_NO_AUTOMOUNT flag to suppress terminal automount
  ...
2011-01-16 11:31:50 -08:00
David Howells 9875cf8064 Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation.  The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).

This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.

The ->d_automount() dentry operation:

	struct vfsmount *(*d_automount)(struct path *mountpoint);

takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted.  If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar).  If there's a collision with
another automount attempt, NULL should be returned.  If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned.  In any other case, an error code should be
returned.

The ->d_automount() operation is called with no locks held and may sleep.  At
this point the pathwalk algorithm will be in ref-walk mode.

Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints.  It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful.  The path will be updated to point to the mounted
filesystem if a successful automount took place.

__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()).  This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).

__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.

follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".

I've also extracted the mount/don't-mount logic from autofs4 and included it
here.  It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname.  If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.

I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points.  This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate().  This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary.  It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.

[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]

Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:05:03 -05:00
Linus Torvalds 6ab8219649 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: restore multiple bd_link_disk_holder() support
  block cfq: compensate preempted queue even if it has no slice assigned
  block cfq: make queue preempt work for queues from different workload
2011-01-14 13:32:07 -08:00
Linus Torvalds 18bce371ae Merge branch 'for-2.6.38' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux: (62 commits)
  nfsd4: fix callback restarting
  nfsd: break lease on unlink, link, and rename
  nfsd4: break lease on nfsd setattr
  nfsd: don't support msnfs export option
  nfsd4: initialize cb_per_client
  nfsd4: allow restarting callbacks
  nfsd4: simplify nfsd4_cb_prepare
  nfsd4: give out delegations more quickly in 4.1 case
  nfsd4: add helper function to run callbacks
  nfsd4: make sure sequence flags are set after destroy_session
  nfsd4: re-probe callback on connection loss
  nfsd4: set sequence flag when backchannel is down
  nfsd4: keep finer-grained callback status
  rpc: allow xprt_class->setup to return a preexisting xprt
  rpc: keep backchannel xprt as long as server connection
  rpc: move sk_bc_xprt to svc_xprt
  nfsd4: allow backchannel recovery
  nfsd4: support BIND_CONN_TO_SESSION
  nfsd4: modify session list under cl_lock
  Documentation: fl_mylease no longer exists
  ...

Fix up conflicts in fs/nfsd/vfs.c with the vfs-scale work.  The
vfs-scale work touched some msnfs cases, and this merge removes support
for that entirely, so the conflict was trivial to resolve.
2011-01-14 13:17:26 -08:00
Tejun Heo 49731baa41 block: restore multiple bd_link_disk_holder() support
Commit e09b457b (block: simplify holder symlink handling) incorrectly
assumed that there is only one link at maximum.  dm may use multiple
links and expects block layer to track reference count for each link,
which is different from and unrelated to the exclusive device holder
identified by @holder when the device is opened.

Remove the single holder assumption and automatic removal of the link
and revive the per-link reference count tracking.  The code
essentially behaves the same as before commit e09b457b sans the
unnecessary kobject reference count dancing.

While at it, note that this facility should not be used by anyone else
than the current ones.  Sysfs symlinks shouldn't be abused like this
and the whole thing doesn't belong in the block layer at all.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Milan Broz <mbroz@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-14 18:44:22 +01:00
Linus Torvalds 275220f0fc Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
  block: ensure that completion error gets properly traced
  blktrace: add missing probe argument to block_bio_complete
  block cfq: don't use atomic_t for cfq_group
  block cfq: don't use atomic_t for cfq_queue
  block: trace event block fix unassigned field
  block: add internal hd part table references
  block: fix accounting bug on cross partition merges
  kref: add kref_test_and_get
  bio-integrity: mark kintegrityd_wq highpri and CPU intensive
  block: make kblockd_workqueue smarter
  Revert "sd: implement sd_check_events()"
  block: Clean up exit_io_context() source code.
  Fix compile warnings due to missing removal of a 'ret' variable
  fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
  block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
  cfq-iosched: don't check cfqg in choose_service_tree()
  fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
  cdrom: export cdrom_check_events()
  sd: implement sd_check_events()
  sr: implement sr_check_events()
  ...
2011-01-13 10:45:01 -08:00
Al Viro c74a1cbb3c pass default dentry_operations to mount_pseudo()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:03:43 -05:00
Al Viro c8aebb0c9f per-superblock default ->d_op
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:02:34 -05:00
Alexey Dobriyan 57cc7215b7 headers: kobject.h redux
Remove kobject.h from files which don't need it, notably,
sched.h and fs.h.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-10 08:51:44 -08:00
Nick Piggin ceb5bdc2d2 fs: dcache per-bucket dcache hash locking
We can turn the dcache hash locking from a global dcache_hash_lock into
per-bucket locking.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:31 +11:00
Nick Piggin b74c79e993 fs: provide rcu-walk aware permission i_ops
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:29 +11:00
Nick Piggin 44a7d7a878 fs: cache optimise dentry and inode for rcu-walk
Put dentry and inode fields into top of data structure.  This allows RCU path
traversal to perform an RCU dentry lookup in a path walk by touching only the
first 56 bytes of the dentry.

We also fit in 8 bytes of inline name in the first 64 bytes, so for short
names, only 64 bytes needs to be touched to perform the lookup. We should
get rid of the hash->prev pointer from the first 64 bytes, and fit 16 bytes
of name in there, which will take care of 81% rather than 32% of the kernel
tree.

inode is also rearranged so that RCU lookup will only touch a single cacheline
in the inode, plus one in the i_ops structure.

This is important for directory component lookups in RCU path walking. In the
kernel source, directory names average is around 6 chars, so this works.

When we reach the last element of the lookup, we need to lock it and take its
refcount which requires another cacheline access.

Align dentry and inode operations structs, so members will be at predictable
offsets and we can group common operations into head of structure.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:28 +11:00
Nick Piggin ff0c7d15f9 fs: avoid inode RCU freeing for pseudo fs
Pseudo filesystems that don't put inode on RCU list or reachable by
rcu-walk dentries do not need to RCU free their inodes.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Nick Piggin fa0d7e3de6 fs: icache RCU free inodes
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Nick Piggin b5c84bf6f6 fs: dcache remove dcache_lock
dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:23 +11:00
J. Bruce Fields c45821d263 locks: eliminate fl_mylease callback
The nfs server only supports read delegations for now, so we don't care
how conflicts are determined.  All we care is that unlocks are
recognized as matching the leases they are meant to remove.  After the
last patch, a comparison of struct files will work for that purpose.  So
we no longer need this callback.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-04 16:49:28 -05:00
Tejun Heo 77ea887e43 implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland.  There are several issues with this.

* Polling is done by periodically opening the device.  For SCSI
  devices, the command sequence generated by such action involves a
  few different commands including TEST_UNIT_READY.  This behavior,
  while perfectly legal, is different from Windows which only issues
  single command, GET_EVENT_STATUS_NOTIFICATION.  Unfortunately, some
  ATAPI devices lock up after being periodically queried such command
  sequences.

* There is no reliable and unintrusive way for a userland program to
  tell whether the target device is safe for media presence polling.
  For example, polling for media presence during an on-going burning
  session can make it fail.  The polling program can avoid this by
  opening the device with O_EXCL but then it risks making a valid
  exclusive user of the device fail w/ -EBUSY.

* Userland polling is unnecessarily heavy and in-kernel implementation
  is lighter and better coordinated (workqueue, timer slack).

This patch implements framework for in-kernel disk event handling,
which includes media presence polling.

* bdops->check_events() is added, which supercedes ->media_changed().
  It should check whether there's any pending event and return if so.
  Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
  DISK_EVENT_EJECT_REQUEST.  ->check_events() is guaranteed not to be
  called parallelly.

* gendisk->events and ->async_events are added.  These should be
  initialized by block driver before passing the device to add_disk().
  The former contains the mask of all supported events and the latter
  the mask of all events which the device can report without polling.
  /sys/block/*/events[_async] export these to userland.

* Kernel parameter block.events_dfl_poll_msecs controls the system
  polling interval (default is 0 which means disable) and
  /sys/block/*/events_poll_msecs control polling intervals for
  individual devices (default is -1 meaning use system setting).  Note
  that if a device can report all supported events asynchronously and
  its polling interval isn't explicitly set, the device won't be
  polled regardless of the system polling interval.

* If a device is opened exclusively with write access, event checking
  is automatically disabled until all write exclusive accesses are
  released.

* There are event 'clearing' events.  For example, both of currently
  defined events are cleared after the device has been successfully
  opened.  This information is passed to ->check_events() callback
  using @clearing argument as a hint.

* Event checking is always performed from system_nrt_wq and timer
  slack is set to 25% for polling.

* Nothing changes for drivers which implement ->media_changed() but
  not ->check_events().  Going forward, all drivers will be converted
  to ->check_events() and ->media_change() will be dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-16 17:53:38 +01:00
Linus Torvalds 6072d13c42 Call the filesystem back whenever a page is removed from the page cache
NFS needs to be able to release objects that are stored in the page
cache once the page itself is no longer visible from the page cache.

This patch adds a callback to the address space operations that allows
filesystems to perform page cleanups once the page has been removed
from the page cache.

Original patch by: Linus Torvalds <torvalds@linux-foundation.org>
[trondmy: cover the cases of invalidate_inode_pages2() and
          truncate_inode_pages()]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-02 09:55:21 -05:00
Loïc Minier 3a3a1af37f include/linux/fs.h: fix userspace build
dpkg uses fiemap but didn't particularly need to include stdint.h so far.
Since 367a51a339 ("fs: Add FITRIM ioctl"), build of linux/fs.h failed in
dpkg with:

  In file included from ../../src/filesdb.c:27:0:
  /usr/include/linux/fs.h:37:2: error: expected specifier-qualifier-list before 'uint64_t'

Use exportable type __u64 to avoid the dependency on stdint.h.

b31d42a5af ("Fix compile brekage with !CONFIG_BLOCK") fixed only the
kernel build by including linux/types.h, but this also fixed "make
headers_check", so don't revert it.

Signed-off-by: Loïc Minier <loic.minier@linaro.org>
Tested-by: Arnd Bergmann <arnd.bergmann@linaro.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-25 06:50:38 +09:00
Lukas Czerner 93bb41f4f8 fs: Do not dispatch FITRIM through separate super_operation
There was concern that FITRIM ioctl is not common enough to be included
in core vfs ioctl, as Christoph Hellwig pointed out there's no real point
in dispatching this out to a separate vector instead of just through
->ioctl.

So this commit removes ioctl_fstrim() from vfs ioctl and trim_fs
from super_operation structure.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 21:18:35 -05:00
Tejun Heo d4d7762995 block: clean up blkdev_get() wrappers and their users
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.

All users are converted.  Most conversions are mechanical and don't
introduce any behavior difference.  There are several exceptions.

* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
  reason to OR it explicitly on blkdev_put().

* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
  sb->s_mode.

* With the above changes, sb->s_mode now always should contain
  FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
  errors.

The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:18 +01:00
Tejun Heo e525fd89d3 block: make blkdev_get/put() handle exclusive access
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.

* blkdev_get/put() are the primary open and close functions.

* bd_claim/release() deal with exclusive open.

* open/close_bdev_exclusive() are combination of open and claim and
  the other way around, respectively.

* bd_link/unlink_disk_holder() to create and remove holder/slave
  symlinks.

* open_by_devnum() wraps bdget() + blkdev_get().

The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access.  Reorganize the interface such that,

* blkdev_get() is extended to include exclusive access management.
  @holder argument is added and, if is @FMODE_EXCL specified, it will
  gain exclusive access atomically w.r.t. other exclusive accesses.

* blkdev_put() is similarly extended.  It now takes @mode argument and
  if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
  the last exclusive claim is released, the holder/slave symlinks are
  removed automatically.

* bd_claim/release() and close_bdev_exclusive() are no longer
  necessary and either made static or removed.

* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
  is no longer necessary and removed.

* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
  and blkdev_get().  It also has an unexpected extra bdev_read_only()
  test which probably should be moved into blkdev_get().

* open_by_devnum() is modified to take @holder argument and pass it to
  blkdev_get().

Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should).  This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.

open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features.  Well, let's leave them for another day.

Most conversions are straight-forward.  drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:17 +01:00
Tejun Heo e09b457bdb block: simplify holder symlink handling
Code to manage symlinks in /sys/block/*/{holders|slaves} are overly
complex with multiple holder considerations, redundant extra
references to all involved kobjects, unused generic kobject holder
support and unnecessary mixup with bd_claim/release functionalities.

Strip it down to what's necessary (single gendisk holder) and make it
use a separate interface.  This is a step for cleaning up
bd_claim/release.  This patch makes dm-table slightly more complex but
it will be simplified again with further changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
2010-11-13 11:55:17 +01:00
Christoph Hellwig bb8430a2c8 locks: remove fl_copy_lock lock_manager operation
This one was only used for a nasty hack in nfsd, which has recently
been removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-31 06:35:15 -07:00
J. Bruce Fields 05fa3135fd locks: fix setlease methods to free passed-in lock
We modified setlease to require the caller to allocate the new lease in
the case of creating a new lease, but forgot to fix up the filesystem
methods.

Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-30 18:08:15 -07:00
Linus Torvalds 435f49a518 readv/writev: do the same MAX_RW_COUNT truncation that read/write does
We used to protect against overflow, but rather than return an error, do
what read/write does, namely to limit the total size to MAX_RW_COUNT.
This is not only more consistent, but it also means that any broken
low-level read/write routine that still keeps counts in 'int' can't
break.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-29 10:36:49 -07:00
Al Viro ceefda6931 switch get_sb_ns() users
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:03 -04:00
Al Viro 51139adac9 convert get_sb_pseudo() users
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:33 -04:00
Al Viro 3c26ff6e49 convert get_sb_nodev() users
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:31 -04:00
Al Viro fc14f2fef6 convert get_sb_single() users
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:28 -04:00
Al Viro 152a083666 new helper: mount_bdev()
... and switch of the obvious get_sb_bdev() users to ->mount()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:13 -04:00
Al Viro c96e41e92b beginning of transtion: ->mount()
eventual replacement for ->get_sb() - does *not* get vfsmount,
return ERR_PTR(error) or root of subtree to be mounted.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:15:06 -04:00
Ingo Molnar b31d42a5af Fix compile brekage with !CONFIG_BLOCK
Today's git tree fails to build on !CONFIG_BLOCK, due to upstream commit
367a51a339 ("fs: Add FITRIM ioctl"):

 include/linux/fs.h:36: error: expected specifier-qualifier-list before ‘uint64_t’
 include/linux/fs.h:36: error: expected specifier-qualifier-list before ‘uint64_t’
 include/linux/fs.h:36: error: expected specifier-qualifier-list before ‘uint64_t’

The commit adds uint64_t type usage to fs.h, but linux/types.h is not included
explicitly - it's only included implicitly via linux/blk_types.h, and there only if
CONFIG_BLOCK is enabled.

Add the explicit #include to fix this.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28 09:02:15 -07:00
Theodore Ts'o a107e5a3a4 Merge branch 'next' into upstream-merge
Conflicts:
	fs/ext4/inode.c
	fs/ext4/mballoc.c
	include/trace/events/ext4.h
2010-10-27 23:44:47 -04:00
Lukas Czerner 367a51a339 fs: Add FITRIM ioctl
Adds an filesystem independent ioctl to allow implementation of file
system batched discard support. I takes fstrim_range structure as an
argument. fstrim_range is definec in the include/fs.h and its
definition is as follows.

struct fstrim_range {
	start;
	len;
	minlen;
}

start	- first Byte to trim
len	- number of Bytes to trim from start
minlen	- minimum extent length to trim, free extents shorter than this
	  number of Bytes will be ignored. This will be rounded up to fs
	  block size.

It is also possible to specify NULL as an argument. In this case the
arguments will set itself as follows:

start = 0;
len = ULLONG_MAX;
minlen = 0;

So it will trim the whole file system at one run.

After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:11 -04:00
Linus Torvalds 7420a8c0de Merge branch 'flock' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl
* 'flock' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
  locks: turn lock_flocks into a spinlock
  fasync: re-organize fasync entry insertion to allow it under a spinlock
  locks/nfsd: allocate file lock outside of spinlock
  lockd: fix nlmsvc_notify_blocked locking
  lockd: push lock_flocks down
2010-10-27 18:13:34 -07:00
Linus Torvalds f7347ce4ee fasync: re-organize fasync entry insertion to allow it under a spinlock
You currently cannot use "fasync_helper()" in an atomic environment to
insert a new fasync entry, because it will need to allocate the new
"struct fasync_struct".

Yet fcntl_setlease() wants to call this under lock_flocks(), which is in
the process of being converted from the BKL to a spinlock.

In order to fix this, this abstracts out the actual fasync list
insertion and the fasync allocations into functions of their own, and
teaches fs/locks.c to pre-allocate the fasync_struct entry.  That way
the actual list insertion can happen while holding the required
spinlock.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bfields@redhat.com: rebase on top of my changes to Arnd's patch]
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-27 22:06:17 +02:00
Arnd Bergmann c5b1f0d92c locks/nfsd: allocate file lock outside of spinlock
As suggested by Christoph Hellwig, this moves allocation
of new file locks out of generic_setlease into the
callers, nfs4_open_delegation and fcntl_setlease in order
to allow GFP_KERNEL allocations when lock_flocks has
become a spinlock.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
2010-10-27 21:41:50 +02:00
Linus Torvalds 426e1f5cec Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
  split invalidate_inodes()
  fs: skip I_FREEING inodes in writeback_sb_inodes
  fs: fold invalidate_list into invalidate_inodes
  fs: do not drop inode_lock in dispose_list
  fs: inode split IO and LRU lists
  fs: switch bdev inode bdi's correctly
  fs: fix buffer invalidation in invalidate_list
  fsnotify: use dget_parent
  smbfs: use dget_parent
  exportfs: use dget_parent
  fs: use RCU read side protection in d_validate
  fs: clean up dentry lru modification
  fs: split __shrink_dcache_sb
  fs: improve DCACHE_REFERENCED usage
  fs: use percpu counter for nr_dentry and nr_dentry_unused
  fs: simplify __d_free
  fs: take dcache_lock inside __d_path
  fs: do not assign default i_ino in new_inode
  fs: introduce a per-cpu last_ino allocator
  new helper: ihold()
  ...
2010-10-26 17:58:44 -07:00
Linus Torvalds 31453a9764 Merge branch 'akpm-incoming-1'
* akpm-incoming-1: (176 commits)
  scripts/checkpatch.pl: add check for declaration of pci_device_id
  scripts/checkpatch.pl: add warnings for static char that could be static const char
  checkpatch: version 0.31
  checkpatch: statement/block context analyser should look at sanitised lines
  checkpatch: handle EXPORT_SYMBOL for DEVICE_ATTR and similar
  checkpatch: clean up structure definition macro handline
  checkpatch: update copyright dates
  checkpatch: Add additional attribute #defines
  checkpatch: check for incorrect permissions
  checkpatch: ensure kconfig help checks only apply when we are adding help
  checkpatch: simplify and consolidate "missing space after" checks
  checkpatch: add check for space after struct, union, and enum
  checkpatch: returning errno typically should be negative
  checkpatch: handle casts better fixing false categorisation of : as binary
  checkpatch: ensure we do not collapse bracketed sections into constants
  checkpatch: suggest cleanpatch and cleanfile when appropriate
  checkpatch: types may sit on a line on their own
  checkpatch: fix regressions in "fix handling of leading spaces"
  div64_u64(): improve precision on 32bit platforms
  lib/parser: cleanup match_number()
  ...
2010-10-26 17:15:20 -07:00
Eric Dumazet 518de9b39e fs: allow for more than 2^31 files
Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :

<quote>

We were seeing a failure which prevented boot.  The kernel was incapable
of creating either a named pipe or unix domain socket.  This comes down
to a common kernel function called unix_create1() which does:

        atomic_inc(&unix_nr_socks);
        if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                goto out;

The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().

        n = (mempages * (PAGE_SIZE / 1024)) / 10;
        files_stat.max_files = n;

In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000).  That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.

</quote>

Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of atomic_t.

get_max_files() is changed to return an unsigned long.  get_nr_files() is
changed to return a long.

unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.

Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968

After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704     0       2147483648

Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:15 -07:00
Linus Torvalds f9ba5375a8 Merge branch 'ima-memory-use-fixes'
* ima-memory-use-fixes:
  IMA: fix the ToMToU logic
  IMA: explicit IMA i_flag to remove global lock on inode_delete
  IMA: drop refcnt from ima_iint_cache since it isn't needed
  IMA: only allocate iint when needed
  IMA: move read counter into struct inode
  IMA: use i_writecount rather than a private counter
  IMA: use inode->i_lock to protect read and write counters
  IMA: convert internal flags from long to char
  IMA: use unsigned int instead of long for counters
  IMA: drop the inode opencount since it isn't needed for operation
  IMA: use rbtree instead of radix tree for inode information cache
2010-10-26 11:37:48 -07:00
Eric Paris 196f518128 IMA: explicit IMA i_flag to remove global lock on inode_delete
Currently for every removed inode IMA must take a global lock and search
the IMA rbtree looking for an associated integrity structure.  Instead
we explicitly mark an inode when we add an integrity structure so we
only have to take the global lock and do the removal if it exists.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 11:37:19 -07:00
Eric Paris a178d2027d IMA: move read counter into struct inode
IMA currently allocated an inode integrity structure for every inode in
core.  This stucture is about 120 bytes long.  Most files however
(especially on a system which doesn't make use of IMA) will never need
any of this space.  The problem is that if IMA is enabled we need to
know information about the number of readers and the number of writers
for every inode on the box.  At the moment we collect that information
in the per inode iint structure and waste the rest of the space.  This
patch moves those counters into the struct inode so we can eventually
stop allocating an IMA integrity structure except when absolutely
needed.

This patch does the minimum needed to move the location of the data.
Further cleanups, especially the location of counter updates, may still
be possible.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 11:37:18 -07:00
Nick Piggin 7ccf19a804 fs: inode split IO and LRU lists
The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:15 -04:00
Christoph Hellwig 312d3ca856 fs: use percpu counter for nr_dentry and nr_dentry_unused
The nr_dentry stat is a globally touched cacheline and atomic operation
twice over the lifetime of a dentry. It is used for the benfit of userspace
only. Turn it into a per-cpu counter and always decrement it in d_free instead
of doing various batching operations to reduce lock hold times in the callers.

Based on an earlier patch from Nick Piggin <npiggin@suse.de>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:12 -04:00
Christoph Hellwig 85fe4025c6 fs: do not assign default i_ino in new_inode
Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves.  For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:11 -04:00
Al Viro 7de9c6ee3e new helper: ihold()
Clones an existing reference to inode; caller must already hold one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:11 -04:00
Christoph Hellwig 646ec4615c fs: remove inode_add_to_list/__inode_add_to_list
Split up inode_add_to_list/__inode_add_to_list.  Locking for the two
lists will be split soon so these helpers really don't buy us much
anymore.

The __ prefixes for the sb list helpers will go away soon, but until
inode_lock is gone we'll need them to distinguish between the locked
and unlocked variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:10 -04:00
Nick Piggin 9e38d86ff2 fs: Implement lazy LRU updates for inodes
Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic.  We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.

This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:09 -04:00
Dave Chinner cffbc8aa33 fs: Convert nr_inodes and nr_unused to per-cpu counters
The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.

Based on a patch originally from Eric Dumazet.

[AV: cleaned up a bit, fixed build breakage on weird configs

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:09 -04:00
Al Viro 1d3382cbf0 new helper: inode_unhashed()
note: for race-free uses you inode_lock held

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:24:15 -04:00
Al Viro a8dade34e3 unexport invalidate_inodes
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:23:32 -04:00
KAMEZAWA Hiroyuki 4a3956c790 vfs: introduce FMODE_UNSIGNED_OFFSET for allowing negative f_pos
Now, rw_verify_area() checsk f_pos is negative or not.  And if negative,
returns -EINVAL.

But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..  has
negative offsets.  And we can't do any access via read/write to the
file(device).

So introduce FMODE_UNSIGNED_OFFSET to allow negative file offsets.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:21 -04:00
Eric Dumazet 7e360c38ab fs: allow for more than 2^31 files
Andrew,

Could you please review this patch, you probably are the right guy to
take it, because it crosses fs and net trees.

Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
depend on previous patch (sysctl: fix min/max handling in
__do_proc_doulongvec_minmax())

Thanks !

[PATCH V4] fs: allow for more than 2^31 files

Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :

<quote>

We were seeing a failure which prevented boot.  The kernel was incapable
of creating either a named pipe or unix domain socket.  This comes down
to a common kernel function called unix_create1() which does:

        atomic_inc(&unix_nr_socks);
        if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                goto out;

The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().

        n = (mempages * (PAGE_SIZE / 1024)) / 10;
        files_stat.max_files = n;

In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000).  That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.

</quote>

Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of
atomic_t.

get_max_files() is changed to return an unsigned long.
get_nr_files() is changed to return a long.

unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.

Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968

After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704     0       2147483648

Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:20 -04:00
Christoph Hellwig 56b0dacfa2 fs: mark destroy_inode static
Hugetlbfs used to need it, but after the destroy_inode and evict_inode
changes it's not required anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:19 -04:00
Christoph Hellwig c37650161a fs: add sync_inode_metadata
Add a new helper to write out the inode using the writeback code,
that is including the correct dirty bit and list manipulation.  A few
of filesystems already opencode this, and a lot of others should be
using it instead of using write_inode_now which also writes out the
data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:19 -04:00
Linus Torvalds a2887097f2 Merge branch 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
  xen-blkfront: disable barrier/flush write support
  Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
  block: remove BLKDEV_IFL_WAIT
  aic7xxx_old: removed unused 'req' variable
  block: remove the BH_Eopnotsupp flag
  block: remove the BLKDEV_IFL_BARRIER flag
  block: remove the WRITE_BARRIER flag
  swap: do not send discards as barriers
  fat: do not send discards as barriers
  ext4: do not send discards as barriers
  jbd2: replace barriers with explicit flush / FUA usage
  jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
  jbd: replace barriers with explicit flush / FUA usage
  nilfs2: replace barriers with explicit flush / FUA usage
  reiserfs: replace barriers with explicit flush / FUA usage
  gfs2: replace barriers with explicit flush / FUA usage
  btrfs: replace barriers with explicit flush / FUA usage
  xfs: replace barriers with explicit flush / FUA usage
  block: pass gfp_mask and flags to sb_issue_discard
  dm: convey that all flushes are processed as empty
  ...
2010-10-22 17:07:18 -07:00
Linus Torvalds 092e0e7e52 Merge branch 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
  vfs: make no_llseek the default
  vfs: don't use BKL in default_llseek
  llseek: automatically add .llseek fop
  libfs: use generic_file_llseek for simple_attr
  mac80211: disallow seeks in minstrel debug code
  lirc: make chardev nonseekable
  viotape: use noop_llseek
  raw: use explicit llseek file operations
  ibmasmfs: use generic_file_llseek
  spufs: use llseek in all file operations
  arm/omap: use generic_file_llseek in iommu_debug
  lkdtm: use generic_file_llseek in debugfs
  net/wireless: use generic_file_llseek in debugfs
  drm: use noop_llseek
2010-10-22 10:52:56 -07:00
Linus Torvalds 79f14b7c56 Merge branch 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl
* 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: (30 commits)
  BKL: remove BKL from freevxfs
  BKL: remove BKL from qnx4
  autofs4: Only declare function when CONFIG_COMPAT is defined
  autofs: Only declare function when CONFIG_COMPAT is defined
  ncpfs: Lock socket in ncpfs while setting its callbacks
  fs/locks.c: prepare for BKL removal
  BKL: Remove BKL from ncpfs
  BKL: Remove BKL from OCFS2
  BKL: Remove BKL from squashfs
  BKL: Remove BKL from jffs2
  BKL: Remove BKL from ecryptfs
  BKL: Remove BKL from afs
  BKL: Remove BKL from USB gadgetfs
  BKL: Remove BKL from autofs4
  BKL: Remove BKL from isofs
  BKL: Remove BKL from fat
  BKL: Remove BKL from ext2 filesystem
  BKL: Remove BKL from do_new_mount()
  BKL: Remove BKL from cgroup
  BKL: Remove BKL from NTFS
  ...
2010-10-22 10:52:01 -07:00
Linus Torvalds f3270b16e0 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (48 commits)
  ocfs2: Avoid to evaluate xattr block flags again.
  ocfs2/cluster: Release debugfs file elapsed_time_in_ms
  ocfs2: Add a mount option "coherency=*" to handle cluster coherency for O_DIRECT writes.
  Initialize max_slots early
  When I tried to compile I got the following warning: fs/ocfs2/slot_map.c: In function ‘ocfs2_init_slot_info’: fs/ocfs2/slot_map.c:360: warning: ‘bytes’ may be used uninitialized in this function fs/ocfs2/slot_map.c:360: note: ‘bytes’ was declared here Compiler: gcc version 4.4.3 (GCC) on Mandriva I'm not sure why this warning occurs, I think compiler don't know that variable "bytes" is initialized when it is sent by reference to ocfs2_slot_map_physical_size and it throws that ugly warning. However, a simple initialization of "bytes" variable with 0 will fix it.
  ocfs2: validate bg_free_bits_count after update
  ocfs2/cluster: Bump up dlm protocol to version 1.1
  ocfs2/cluster: Show per region heartbeat elapsed time
  ocfs2/cluster: Add mlogs for heartbeat up/down events
  ocfs2/cluster: Create debugfs dir/files for each region
  ocfs2/cluster: Create debugfs files for live, quorum and failed region bitmaps
  ocfs2/cluster: Maintain bitmap of failed regions
  ocfs2/cluster: Maintain bitmap of quorum regions
  ocfs2/cluster: Track bitmap of live heartbeat regions
  ocfs2/cluster: Track number of global heartbeat regions
  ocfs2/cluster: Maintain live node bitmap per heartbeat region
  ocfs2/cluster: Reorganize o2hb debugfs init
  ocfs2/cluster: Check slots for unconfigured live nodes
  ocfs2/cluster: Print messages when adding/removing nodes
  ocfs2/cluster: Print messages when adding/removing heartbeat regions
  ...
2010-10-21 19:01:34 -07:00
Jens Axboe fa251f8990 Merge branch 'v2.6.36-rc8' into for-2.6.37/barrier
Conflicts:
	block/blk-core.c
	drivers/block/loop.c
	mm/swapfile.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-19 09:13:04 +02:00
Joel Becker fc3718918f Merge branch 'globalheartbeat-2' of git://oss.oracle.com/git/smushran/linux-2.6 into ocfs2-merge-window
Conflicts:
	fs/ocfs2/ocfs2.h
2010-10-15 13:03:09 -07:00
Ingo Molnar 556ef63255 Merge commit 'v2.6.36-rc7' into core/rcu
Merge reason: Update from -rc3 to -rc7.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-07 09:43:45 +02:00
Arnd Bergmann b89f432133 fs/locks.c: prepare for BKL removal
This prepares the removal of the big kernel lock from the
file locking code. We still use the BKL as long as fs/lockd
uses it and ceph might sleep, but we can flip the definition
to a private spinlock as soon as that's done.
All users outside of fs/lockd get converted to use
lock_flocks() instead of lock_kernel() where appropriate.

Based on an earlier patch to use a spinlock from Matthew
Wilcox, who has attempted this a few times before, the
earliest patch from over 10 years ago turned it into
a semaphore, which ended up being slower than the BKL
and was subsequently reverted.

Someone should do some serious performance testing when
this becomes a spinlock, since this has caused problems
before. Using a spinlock should be at least as good
as the BKL in theory, but who knows...

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
2010-10-05 11:02:04 +02:00
Sage Weil 8b15575cae fs: {lock,unlock}_flocks() stubs to prepare for BKL removal
The lock structs are currently protected by the BKL, but are accessed by
code in fs/locks.c and misc file system and DLM code.  These stubs will
allow all users to switch to the new interface before the implementation
is changed to a spinlock.

Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-21 17:27:44 -07:00
Arnd Bergmann 1ec5584e3e libfs: use generic_file_llseek for simple_attr
Simple attribute files need to be seekable to
allow resetting the file for another read.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-09-16 10:33:19 +02:00
Patrick J. LoPresti 30ca22c70e ext3/ext4: Factor out disk addressability check
As part of adding support for OCFS2 to mount huge volumes, we need to
check that the sector_t and page cache of the system are capable of
addressing the entire volume.

An identical check already appears in ext3 and ext4.  This patch moves
the addressability check into its own function in fs/libfs.c and
modifies ext3 and ext4 to invoke it.

[Edited to -EINVAL instead of BUG_ON() for bad blocksize_bits -- Joel]

Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-10 08:41:42 -07:00
Christoph Hellwig 8c55536782 block: remove the BLKDEV_IFL_BARRIER flag
Remove support for barriers on discards, which is unused now.  Also
remove the DISCARD_NOBARRIER I/O type in favour of just setting the
rw flags up locally in blkdev_issue_discard.

tj: Also remove DISCARD_SECURE and use REQ_SECURE directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:40 +02:00
Christoph Hellwig 31725e65c7 block: remove the WRITE_BARRIER flag
It's unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:40 +02:00
Tejun Heo 4fed947cb3 block: implement REQ_FLUSH/FUA based interface for FLUSH/FUA requests
Now that the backend conversion is complete, export sequenced
FLUSH/FUA capability through REQ_FLUSH/FUA flags.  REQ_FLUSH means the
device cache should be flushed before executing the request.  REQ_FUA
means that the data in the request should be on non-volatile media on
completion.

Block layer will choose the correct way of implementing the semantics
and execute it.  The request may be passed to the device directly if
the device can handle it; otherwise, it will be sequenced using one or
more proxy requests.  Devices will never see REQ_FLUSH and/or FUA
which it doesn't support.

Also, unlike the original REQ_HARDBARRIER, REQ_FLUSH/FUA requests are
never failed with -EOPNOTSUPP.  If the underlying device doesn't
support FLUSH/FUA, the block layer simply make those noop.  IOW, it no
longer distinguishes between writeback cache which doesn't support
cache flush and writethrough/no cache.  Devices which have WB cache
w/o flush are very difficult to come by these days and there's nothing
much we can do anyway, so it doesn't make sense to require everyone to
implement -EOPNOTSUPP handling.  This will simplify filesystems and
block drivers as they can drop -EOPNOTSUPP retry logic for barriers.

* QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
  blk-flush.c.

* REQ_FLUSH w/o data can also be directly passed to drivers without
  sequencing but some drivers assume that zero length requests don't
  have rq->bio which isn't true for these requests requiring the use
  of proxy requests.

* REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
  copied from bio to request.

* WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
  WRITE_FLUSH_FUA are added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:37 +02:00
Arnd Bergmann 4d2deb40b2 kernel: __rcu annotations
This adds annotations for RCU operations in core kernel components

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2010-08-19 17:18:03 -07:00
Nick Piggin 6416ccb789 fs: scale files_lock
fs: scale files_lock

Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).

One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.

However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.

A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.

Testing results:

On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.

Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.

Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

                throughput
2.6.34-rc2      24.5
+patch          24.9

                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75

So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.

Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.

Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 08:35:48 -04:00
Nick Piggin d996b62a8d tty: fix fu_list abuse
tty: fix fu_list abuse

tty code abuses fu_list, which causes a bug in remount,ro handling.

If a tty device node is opened on a filesystem, then the last link to the inode
removed, the filesystem will be allowed to be remounted readonly. This is
because fs_may_remount_ro does not find the 0 link tty inode on the file sb
list (because the tty code incorrectly removed it to use for its own purpose).
This can result in a filesystem with errors after it is marked "clean".

Taking idea from Christoph's initial patch, allocate a tty private struct
at file->private_data and put our required list fields in there, linking
file and tty. This makes tty nodes behave the same way as other device nodes
and avoid meddling with the vfs, and avoids this bug.

The error handling is not trivial in the tty code, so for this bugfix, I take
the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
This is not a problem because our allocator doesn't fail small allocs as a rule
anyway. So proper error handling is left as an exercise for tty hackers.

[ Arguably filesystem's device inode would ideally be divorced from the
driver's pseudo inode when it is opened, but in practice it's not clear whether
that will ever be worth implementing. ]

Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 08:35:47 -04:00
Nick Piggin ee2ffa0dfd fs: cleanup files_lock locking
fs: cleanup files_lock locking

Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
manipulate the per-sb files list; unexport the files_lock spinlock.

Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 08:35:47 -04:00
Christoph Hellwig 9cb569d601 remove SWRITE* I/O types
These flags aren't real I/O types, but tell ll_rw_block to always
lock the buffer instead of giving up on a failed trylock.

Instead add a new write_dirty_buffer helper that implements this semantic
and use it from the existing SWRITE* callers.  Note that the ll_rw_block
code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
this patch fixes.

In the ufs code clean up the helper that used to call ll_rw_block
to mirror sync_dirty_buffer, which is the function it implements for
compound buffers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 01:09:01 -04:00
Linus Torvalds 10041d2d14 Merge branch 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  bkl: Remove locked .ioctl file operation
  v4l: Remove reference to bkl ioctl in compat ioctl handling
  logfs: kill BKL
2010-08-13 17:52:35 -07:00
David Howells c788732523 Mark arguments to certain syscalls as being const
Mark arguments to certain system calls as being const where they should be but
aren't.  The list includes:

 (*) The filename arguments of various stat syscalls, execve(), various utimes
     syscalls and some mount syscalls.

 (*) The filename arguments of some syscall helpers relating to the above.

 (*) The buffer argument of various write syscalls.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-13 16:53:13 -07:00
Arnd Bergmann b19dd42faf bkl: Remove locked .ioctl file operation
The last user is gone, so we can safely remove this

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: John Kacur <jkacur@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-08-14 00:24:24 +02:00
Adrian Hunter 8d57a98ccd block: add secure discard
Secure discard is the same as discard except that all copies of the
discarded sectors (perhaps created by garbage collection) must also be
erased.

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12 08:43:30 -07:00
Andrew Morton 13bcbc0087 include/linux/fs.h: complete hexification of FMODE_* constants
One straggler which was missed due to merge ordering issues.

Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:58:59 -07:00
Linus Torvalds 2f9e825d3e Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
  block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
  xen-blkfront: fix missing out label
  blkdev: fix blkdev_issue_zeroout return value
  block: update request stacking methods to support discards
  block: fix missing export of blk_types.h
  writeback: fix bad _bh spinlock nesting
  drbd: revert "delay probes", feature is being re-implemented differently
  drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
  drbd: Disable delay probes for the upcomming release
  writeback: cleanup bdi_register
  writeback: add new tracepoints
  writeback: remove unnecessary init_timer call
  writeback: optimize periodic bdi thread wakeups
  writeback: prevent unnecessary bdi threads wakeups
  writeback: move bdi threads exiting logic to the forker thread
  writeback: restructure bdi forker loop a little
  writeback: move last_active to bdi
  writeback: do not remove bdi from bdi_list
  writeback: simplify bdi code a little
  writeback: do not lose wake-ups in bdi threads
  ...

Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.
2010-08-10 15:22:42 -07:00
Linus Torvalds 8c8946f509 Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits)
  fanotify: use both marks when possible
  fsnotify: pass both the vfsmount mark and inode mark
  fsnotify: walk the inode and vfsmount lists simultaneously
  fsnotify: rework ignored mark flushing
  fsnotify: remove global fsnotify groups lists
  fsnotify: remove group->mask
  fsnotify: remove the global masks
  fsnotify: cleanup should_send_event
  fanotify: use the mark in handler functions
  audit: use the mark in handler functions
  dnotify: use the mark in handler functions
  inotify: use the mark in handler functions
  fsnotify: send fsnotify_mark to groups in event handling functions
  fsnotify: Exchange list heads instead of moving elements
  fsnotify: srcu to protect read side of inode and vfsmount locks
  fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
  fsnotify: use _rcu functions for mark list traversal
  fsnotify: place marks on object in order of group memory address
  vfs/fsnotify: fsnotify_close can delay the final work in fput
  fsnotify: store struct file not struct path
  ...

Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.
2010-08-10 11:39:13 -07:00
Linus Torvalds 5f248c9c25 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
  no need for list_for_each_entry_safe()/resetting with superblock list
  Fix sget() race with failing mount
  vfs: don't hold s_umount over close_bdev_exclusive() call
  sysv: do not mark superblock dirty on remount
  sysv: do not mark superblock dirty on mount
  btrfs: remove junk sb_dirt change
  BFS: clean up the superblock usage
  AFFS: wait for sb synchronization when needed
  AFFS: clean up dirty flag usage
  cifs: truncate fallout
  mbcache: fix shrinker function return value
  mbcache: Remove unused features
  add f_flags to struct statfs(64)
  pass a struct path to vfs_statfs
  update VFS documentation for method changes.
  All filesystems that need invalidate_inode_buffers() are doing that explicitly
  convert remaining ->clear_inode() to ->evict_inode()
  Make ->drop_inode() just return whether inode needs to be dropped
  fs/inode.c:clear_inode() is gone
  fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
  ...

Fix up trivial conflicts in fs/nilfs2/super.c
2010-08-10 11:26:52 -07:00
Jan Kara f446daaea9 mm: implement writeback livelock avoidance using page tagging
We try to avoid livelocks of writeback when some steadily creates dirty
pages in a mapping we are writing out.  For memory-cleaning writeback,
using nr_to_write works reasonably well but we cannot really use it for
data integrity writeback.  This patch tries to solve the problem.

The idea is simple: Tag all pages that should be written back with a
special tag (TOWRITE) in the radix tree.  This can be done rather quickly
and thus livelocks should not happen in practice.  Then we start doing the
hard work of locking pages and sending them to disk only for those pages
that have TOWRITE tag set.

Note: Adding new radix tree tag grows radix tree node from 288 to 296
bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
However, the number of slab/slub items per page remains the same (13 and 7
respectively).

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09 20:44:59 -07:00
Al Viro 7a4dec5389 Fix sget() race with failing mount
If sget() finds a matching superblock being set up, it'll
grab an active reference to it and grab s_umount.  That's
fine - we'll wait for completion of foofs_get_sb() that way.
However, if said foofs_get_sb() fails we'll end up holding
the halfway-created superblock.  deactivate_locked_super()
called by foofs_get_sb() will just unlock the sucker since
we are holding another active reference to it.

What we need is a way to tell if superblock has been successfully
set up.  Unfortunately, neither ->s_root nor the check for
MS_ACTIVE quite fit.  Cheap and easy way, suitable for backport:
new flag set by the (only) caller of ->get_sb().  If that flag
isn't present by the time sget() grabbed s_umount on preexisting
superblock it has found, it's seeing a stillborn and should
just bury it with deactivate_locked_super() (and repeat the search).

Longer term we want to set that flag in ->get_sb() instances (and
check for it to distinguish between "sget() found us a live sb"
and "sget() has allocated an sb, we need to set it up" in there,
instead of checking ->s_root as we do now).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
2010-08-09 16:49:01 -04:00
Christoph Hellwig ebabe9a900 pass a struct path to vfs_statfs
We'll need the path to implement the flags field for statvfs support.
We do have it available in all callers except:

 - ecryptfs_statfs.  This one doesn't actually need vfs_statfs but just
   needs to do a caller to the lower filesystem statfs method.
 - sys_ustat.  Add a non-exported statfs_by_dentry helper for it which
   doesn't won't be able to fill out the flags field later on.

In addition rename the helpers for statfs vs fstatfs to do_*statfs instead
of the misleading vfs prefix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:42 -04:00
Al Viro b57922d97f convert remaining ->clear_inode() to ->evict_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:37 -04:00
Al Viro 45321ac543 Make ->drop_inode() just return whether inode needs to be dropped
... and let iput_final() do the actual eviction or retention

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:35 -04:00
Al Viro 30140837f2 fs/inode.c:clear_inode() is gone
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:34 -04:00
Al Viro 07958f9f5b ->delete_inode() is gone
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:31 -04:00
Al Viro b0683aa638 new helper: end_writeback()
Essentially, the minimal variant of ->evict_inode().  It's
a trimmed-down clear_inode(), sans any fs callbacks.  Once
it returns we know that no async writeback will be happening;
every ->evict_inode() instance should do that once and do that
before doing anything ->write_inode() could interfere with
(e.g. freeing the on-disk inode).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:49 -04:00
Al Viro c6287315cb generic_detach_inode() can be static now
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:48 -04:00
Al Viro be7ce4161f New method - evict_inode()
Hybrid of ->clear_inode() and ->delete_inode(); if present, does
all fs work to be done when in-core inode is about to be gone,
for whatever reason.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:46 -04:00
Al Viro a4ffdde6e5 simplify checks for I_CLEAR/I_FREEING
add I_CLEAR instead of replacing I_FREEING with it.  I_CLEAR is
equivalent to I_FREEING for almost all code looking at either;
it's there to keep track of having called clear_inode() exactly
once per inode lifetime, at some point after having set I_FREEING.
I_CLEAR and I_FREEING never get set at the same time with the
current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
instead of I_CLEAR without loss of information.  As the result of
such change, checks become simpler and the amount of code that needs
to know about I_CLEAR shrinks a lot.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:44 -04:00
Christoph Hellwig 2c27c65ed0 check ATTR_SIZE contraints in inode_change_ok
Make sure we check the truncate constraints early on in ->setattr by adding
those checks to inode_change_ok.  Also clean up and document inode_change_ok
to make this obvious.

As a fallout we don't have to call inode_newsize_ok from simple_setsize and
simplify it down to a truncate_setsize which doesn't return an error.  This
simplifies a lot of setattr implementations and means we use truncate_setsize
almost everywhere.  Get rid of fat_setsize now that it's trivial and mark
ext2_setsize static to make the calling convention obvious.

Keep the inode_newsize_ok in vmtruncate for now as all callers need an
audit for its removal anyway.

Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
needs a deeper audit, but that is left for later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:39 -04:00
Christoph Hellwig 1025774ce4 remove inode_setattr
Replace inode_setattr with opencoded variants of it in all callers.  This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.

In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:

 spufs: explicitly checks for ATTR_SIZE earlier
 btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
 ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above

In addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:37 -04:00
Christoph Hellwig 6a1a90ad1b rename generic_setattr
Despite its name it's now a generic implementation of ->setattr, but
rather a helper to copy attributes from a struct iattr to the inode.
Rename it to setattr_copy to reflect this fact.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:35 -04:00
Christoph Hellwig eafdc7d190 sort out blockdev_direct_IO variants
Move the call to vmtruncate to get rid of accessive blocks to the callers
in prepearation of the new truncate calling sequence.  This was only done
for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
its _newtrunc variant while at it as just opencoding the two additional
paramters is shorted than the name suffix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:29 -04:00
Tejun Heo 7cc015811e bio, fs: separate out bio_types.h and define READ/WRITE constants in terms of BIO_RW_* flags
linux/fs.h hard coded READ/WRITE constants which should match BIO_RW_*
flags.  This is fragile and caused breakage during BIO_RW_* flag
rearrangement.  The hardcoding is to avoid include dependency hell.

Create linux/bio_types.h which contatins definitions for bio data
structures and flags and include it from bio.h and fs.h, and make fs.h
define all READ/WRITE related constants in terms of BIO_RW_* flags.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:53:10 +02:00
Tejun Heo aca27ba961 bio, fs: update RWA_MASK, READA and SWRITE to match the corresponding BIO_RW_* bits
Commit a82afdf (block: use the same failfast bits for bio and request)
moved BIO_RW_* bits around such that they match up with REQ_* bits.
Unfortunately, fs.h hard coded RW_MASK, RWA_MASK, READ, WRITE, READA
and SWRITE as 0, 1, 2 and 3, and expected them to match with BIO_RW_*
bits.  READ/WRITE didn't change but BIO_RW_AHEAD was moved to bit 4
instead of bit 1, breaking RWA_MASK, READA and SWRITE.

This patch updates RWA_MASK, READA and SWRITE such that they match the
BIO_RW_* bits again.  A follow up patch will update the definitions to
directly use BIO_RW_* bits so that this kind of breakage won't happen
again.

Neil also spotted missing RWA_MASK conversion.

Stable: The offending commit a82afdf was released with v2.6.32, so
this patch should be applied to all kernels since then but it must
_NOT_ be applied to kernels earlier than that.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-bisected-by: Vladislav Bolkhovitin <vst@vlnb.net>
Root-caused-by: Neil Brown <neilb@suse.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:53:07 +02:00
Christoph Hellwig 7b6d91daee block: unify flags for struct bio and struct request
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver.  There were two flags in the bio that were
missing in the requests:  BIO_RW_UNPLUG and BIO_RW_AHEAD.  Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:20:39 +02:00
Christoph Hellwig 41f2df6289 block: BARRIER request should imply SYNC
A barrier request should by defintion have priority in get_request
and let the queue be unplugged immediately as it's blocking all forward
progress due to the queue draining.

Most filesystems already get this implicitly by the way how submit_bh
treats the buffer_ordered flag, and gfs2 sets it explicitly.  But btrfs
and XFS are still forgetting to set the flag, as is blkdev_issue_flush
and some places in DM/MD.

For XFS on metadata heavy workloads this gives a consistent speedup
in the 2-3% range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:15:44 +02:00
Linus Torvalds 7e6880951d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (90 commits)
  AppArmor: fix build warnings for non-const use of get_task_cred
  selinux: convert the policy type_attr_map to flex_array
  AppArmor: Enable configuring and building of the AppArmor security module
  TOMOYO: Use pathname specified by policy rather than execve()
  AppArmor: update path_truncate method to latest version
  AppArmor: core policy routines
  AppArmor: policy routines for loading and unpacking policy
  AppArmor: mediation of non file objects
  AppArmor: LSM interface, and security module initialization
  AppArmor: Enable configuring and building of the AppArmor security module
  AppArmor: update Maintainer and Documentation
  AppArmor: functions for domain transitions
  AppArmor: file enforcement routines
  AppArmor: userspace interfaces
  AppArmor: dfa match engine
  AppArmor: contexts used in attaching policy to system objects
  AppArmor: basic auditing infrastructure.
  AppArmor: misc. base functions and defines
  TOMOYO: Update version to 2.3.0
  TOMOYO: Fix quota check.
  ...
2010-08-04 10:28:39 -07:00
Eric Paris 9cfcac810e vfs: re-introduce MAY_CHDIR
Currently MAY_ACCESS means that filesystems must check the permissions
right then and not rely on cached results or the results of future
operations on the object.  This can be because of a call to sys_access() or
because of a call to chdir() which needs to check search without relying on
any future operations inside that dir.  I plan to use MAY_ACCESS for other
purposes in the security system, so I split the MAY_ACCESS and the
MAY_CHDIR cases.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by:  Stephen D. Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
2010-08-02 15:35:06 +10:00
Alexey Dobriyan 6e006701cc dnotify: move dir_notify_enable declaration
Move dir_notify_enable declaration to where it belongs -- dnotify.h .

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:01 -04:00
Signed-off-by: Wu Fengguang 12ed2e36c9 fanotify: FMODE_NONOTIFY and __O_SYNC in sparc conflict
sparc used the same value as FMODE_NONOTIFY so change FMODE_NONOTIFY to be
something unique.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:54 -04:00