linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Josef Bacik	972fbf7798	ext3: don't try to resize if there are no reserved gdt blocks left When trying to resize a ext3 fs and you run out of reserved gdt blocks, you get an error that doesn't actually tell you what went wrong, it just says that the gdb it picked is not correct, which is the case since you don't have any reserved gdt blocks left. This patch adds a check to make sure you have reserved gdt blocks to use, and if not prints out a more relevant error. Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: <linux-ext4@vger.kernel.org> Cc: Andreas Dilger <adilger@sun.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:37 -07:00
Hidehiro Kawai	885e353c74	jbd: don't dirty original metadata buffer on abort Currently, original metadata buffers are dirtied when they are unfiled whether the journal has aborted or not. Eventually these buffers will be written-back to the filesystem by pdflush. This means some metadata buffers are written to the filesystem without journaling if the journal aborts. So if both journal abort and system crash happen at the same time, the filesystem would become inconsistent state. Additionally, replaying journaled metadata can overwrite the latest metadata on the filesystem partly. Because, if the journal aborts, journaled metadata are preserved and replayed during the next mount not to lose uncheckpointed metadata. This would also break the consistency of the filesystem. This patch prevents original metadata buffers from being dirtied on abort by clearing BH_JBDDirty flag from those buffers. Thus, no metadata buffers are written to the filesystem without journaling. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:37 -07:00
Hidehiro Kawai	d1645e526a	jbd: abort when failed to log metadata buffers If we failed to write metadata buffers to the journal space and succeeded to write the commit record, stale data can be written back to the filesystem as metadata in the recovery phase. To avoid this, when we failed to write out metadata buffers, abort the journal before writing the commit record. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:36 -07:00
KOSAKI Motohiro	e575f111dc	coredump_filter: add hugepage dumping Presently hugepage's vma has a VM_RESERVED flag in order not to be swapped. But a VM_RESERVED vma isn't core dumped because this flag is often used for some kernel vmas (e.g. vmalloc, sound related). Thus hugepages are never dumped and it can't be debugged easily. Many developers want hugepages to be included into core-dump. However, We can't read generic VM_RESERVED area because this area is often IO mapping area. then these area reading may change device state. it is definitly undesiable side-effect. So adding a hugepage specific bit to the coredump filter is better. It will be able to hugepage core dumping and doesn't cause any side-effect to any i/o devices. In additional, libhugetlb use hugetlb private mapping pages as anonymous page. Then, hugepage private mapping pages should be core dumped by default. Then, /proc/[pid]/core_dump_filter has two new bits. - bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes) - bit 6 mean hugetlb shared mapping pages are dumped or not. (default: no) I tested by following method. % ulimit -c unlimited % ./crash_hugepage 50 % ./crash_hugepage 50 -p % ls -lh % gdb ./crash_hugepage core % % echo 0x43 > /proc/self/coredump_filter % ./crash_hugepage 50 % ./crash_hugepage 50 -p % ls -lh % gdb ./crash_hugepage core #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> #include "hugetlbfs.h" int main(int argc, char** argv){ char* p; int ch; int mmap_flags = MAP_SHARED; int fd; int nr_pages; while((ch = getopt(argc, argv, "p")) != -1) { switch (ch) { case 'p': mmap_flags &= ~MAP_SHARED; mmap_flags \|= MAP_PRIVATE; break; default: /* nothing/ break; } } argc -= optind; argv += optind; if (argc == 0){ printf("need # of pages\n"); exit(1); } nr_pages = atoi(argv[0]); if (nr_pages < 2) { printf("nr_pages must >2\n"); exit(1); } fd = hugetlbfs_unlinked_fd(); p = mmap(NULL, nr_pages gethugepagesize(), PROT_READ\|PROT_WRITE, mmap_flags, fd, 0); sleep(2); (p + gethugepagesize()) = 1; / COW / sleep(2); / crash! / (int*)0 = 1; return 0; } Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: William Irwin <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:32 -07:00
Nick Piggin	51b07fc3c5	fs: buffer lock use lock bitops trylock_buffer and unlock_buffer open and close a critical section. Hence, we can use the lock bitops to get the desired memory ordering. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:32 -07:00
Nick Piggin	5344b7e648	vmstat: mlocked pages statistics Add NR_MLOCK zone page state, which provides a (conservative) count of mlocked pages (actually, the number of mlocked pages moved off the LRU). Reworked by lts to fit in with the modified mlock page support in the Reclaim Scalability series. [kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo] [lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:31 -07:00
Lee Schermerhorn	ba9ddf4939	Ramfs and Ram Disk pages are unevictable Christoph Lameter pointed out that ram disk pages also clutter the LRU lists. When vmscan finds them dirty and tries to clean them, the ram disk writeback function just redirties the page so that it goes back onto the active list. Round and round she goes... With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no longer the case, as ram disk pages are no longer maintained on the lru. [This makes them unmigratable for defrag or memory hot remove, but that can be addressed by a separate patch series.] However, the ramfs pages behave like ram disk pages used to, so: Define new address_space flag [shares address_space flags member with mapping's gfp mask] to indicate that the address space contains all unevictable pages. This will provide for efficient testing of ramfs pages in page_evictable(). Also provide wrapper functions to set/test the unevictable state to minimize #ifdefs in ramfs driver and any other users of this facility. Set the unevictable state on address_space structures for new ramfs inodes. Test the unevictable state in page_evictable() to cull unevictable pages. These changes depend on [CONFIG_]UNEVICTABLE_LRU. [riel@redhat.com: undo the brd.c part] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Debugged-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:26 -07:00
Lee Schermerhorn	7b854121eb	Unevictable LRU Page Statistics Report unevictable pages per zone and system wide. Kosaki Motohiro added support for memory controller unevictable statistics. [riel@redhat.com: fix printk in show_free_areas()] [akpm@linux-foundation.org: fix units in /proc/vmstats] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Debugged-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:26 -07:00
Rik van Riel	4f98a2fee8	vmscan: split LRU lists into anon & file sets Split the LRU lists in two, one set for pages that are backed by real file systems ("file") and one for pages that are backed by memory and swap ("anon"). The latter includes tmpfs. The advantage of doing this is that the VM will not have to scan over lots of anonymous pages (which we generally do not want to swap out), just to find the page cache pages that it should evict. This patch has the infrastructure and a basic policy to balance how much we scan the anon lists and how much we scan the file lists. The big policy changes are in separate patches. [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset] [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru] [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page] [hugh@veritas.com: memcg swapbacked pages active] [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED] [akpm@linux-foundation.org: fix /proc/vmstat units] [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration] [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo] [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:25 -07:00
Thomas Gleixner	c465a76af6	Merge branches 'timers/clocksource', 'timers/hrtimers', 'timers/nohz', 'timers/ntp', 'timers/posixtimers' and 'timers/debug' into v28-timers-for-linus	2008-10-20 13:14:06 +02:00
Steve French	3270958b71	[CIFS] undo changes in cifs_rename_pending_delete if it errors out The cifs_rename_pending_delete process involves multiple steps. If it fails and we're going to return error, we don't want to leave things in a half-finished state. Add code to the function to undo changes if a call fails. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-20 00:44:19 +00:00
Jeff Layton	9a8165fce7	cifs: track DeletePending flag in cifsInodeInfo cifs: track DeletePending flag in cifsInodeInfo The QPathInfo call returns a flag that indicates whether DELETE_ON_CLOSE is set. Track it in the cifsInodeInfo. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-20 00:33:52 +00:00
Geert Uytterhoeven	54779aabb0	UBIFS: fix ubifs_compress commentary Update the comment for ubifs_compress(), which incorrectly states that it returnsa success/failure indicator. Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-10-19 13:01:37 +03:00
Artem Bityutskiy	fae7fb299f	UBIFS: amend printk It is better to print "Reserved for root" than "Reserved pool size", because it is more obvious for users what this means. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-10-19 13:01:30 +03:00
Adrian Hunter	727d2dc045	UBIFS: do not read unnecessary bytes when unpacking bits Fixes the following Oops: BUG: unable to handle kernel paging request at f8d24000 IP: [<f8ff0657>] :ubifs:ubifs_unpack_bits+0xcd/0x231 pde = 34333067 pte = 00000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: deflate zlib_deflate lzo lzo_decompress lzo_compress ubifs ubi nandsim nand nand_ids nand_ecc mtd nfsd lockd sunrpc exportfs [last unloaded: nand_ecc] Pid: 7450, comm: sync Not tainted (2.6.27-rc8-ubifs-2.6 #27) EIP: 0060:[<f8ff0657>] EFLAGS: 00010206 CPU: 0 EIP is at ubifs_unpack_bits+0xcd/0x231 [ubifs] EAX: 00000000 EBX: 00000000 ECX: d7e43dc0 EDX: 0000ff00 ESI: 00000004 EDI: f8d23ffe EBP: d7e43db4 ESP: d7e43d8c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process sync (pid: 7450, ti=d7e42000 task=eb6f9530 task.ti=d7e42000) Stack: 00000400 c0103db4 dc5e8090 d7e43dc0 d7e43dc0 d7e43dc4 0000001c 00000004 f496d1e0 f8d23ffc d7e43dd4 f8ffac7e f8d23ffe 00000000 f8d23ffe f2b7af68 f496d1e0 f8d23ffc d7e43e2c f8ffadc5 00000000 0001f000 00000000 c03b10a7 Call Trace: [<c0103db4>] ? restore_nocheck_notrace+0x0/0xe [<f8ffac7e>] ? is_a_node+0x43/0x92 [ubifs] [<f8ffadc5>] ? dbg_check_ltab+0xf8/0x5c9 [ubifs] [<c03b10a7>] ? mutex_lock_nested+0x1b2/0x2a0 [<f8ffc86e>] ? ubifs_lpt_start_commit+0x49/0xecb [ubifs] [<c03b0ef3>] ? mutex_unlock+0xd/0xf [<f8fef017>] ? ubifs_tnc_start_commit+0x1cf/0xef8 [ubifs] [<f8fe65d8>] ? do_commit+0x18f/0x52d [ubifs] [<f8fe69f6>] ? ubifs_run_commit+0x80/0xca [ubifs] [<f8fd8d35>] ? ubifs_sync_fs+0xdb/0xf6 [ubifs] [<c0181a07>] ? sync_filesystems+0xc6/0x10c [<c019f279>] ? do_sync+0x3b/0x6a [<c019f2ba>] ? sys_sync+0x12/0x18 [<c0103ced>] ? sysenter_do_call+0x12/0x35 ======================= Code: 4d ec 89 01 8b 45 e8 89 10 89 d8 89 f1 d3 e8 85 c0 74 07 29 d6 83 fe 20 75 2a 89 d8 83 c4 1c 5b 5e 5f 5d c3 0f b6 57 01 c1 e2 08 <0f> b6 47 02 c1 e0 10 09 c2 0f b6 07 09 c2 0f b EIP: [<f8ff0657>] ubifs_unpack_bits+0xcd/0x231 [ubifs] SS:ESP 0068:d7e43d8c ---[ end trace 1bbb4c407a6dd816 ]--- Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-10-19 13:01:21 +03:00
Alexander Belyakov	5bf1723723	[JFFS2] Write buffer offset adjustment for NOR-ECC (Sibley) flash After choosing new c->nextblock, don't leave the wbuf offset field occasionally pointing at the start of the next physical eraseblock. This was causing a BUG() on NOR-ECC (Sibley) flash, where we start writing after the cleanmarker. Among other this fix should cover write buffer offset adjustment after flushing the last page of an eraseblock. Signed-off-by: Alexander Belyakov <abelyako@googlemail.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2008-10-18 11:54:09 +01:00
Linus Torvalds	58617d5e59	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Remove automatic enabling of the HUGE_FILE feature flag ext4: Replace hackish ext4_mb_poll_new_transaction with commit callback ext4: Update Documentation/filesystems/ext4.txt ext4: Remove unused mount options: nomballoc, mballoc, nocheck ext4: Remove compile warnings when building w/o CONFIG_PROC_FS ext4: Add missing newlines to printk messages ext4: Fix file fragmentation during large file write. vfs: Add no_nrwrite_index_update writeback control flag vfs: Remove the range_cont writeback mode. ext4: Use tag dirty lookup during mpage_da_submit_io ext4: let the block device know when unused blocks can be discarded ext4: Don't reuse released data blocks until transaction commits ext4: Use an rbtree for tracking blocks freed during transaction. ext4: Do mballoc init before doing filesystem recovery ext4: Free ext4_prealloc_space using kmem_cache_free ext4: Fix Kconfig typo for ext4dev ext4: Remove an old reference to ext4dev in Makefile comment	2008-10-17 15:08:11 -07:00
Magnus Deininger	57c7b4e68e	9p: fix device file handling In v9fs_get_inode(), for block, as well as char devices (in theory), the function init_special_inode() is called to set up callback functions for file ops. this function uses the file mode's value to determine whether to use block or char dev functions. In v9fs_inode_from_fid(), the function p9mode2unixmode() is used, but for all devices it initially returns S_IFBLK, then uses v9fs_get_inode() to initialise a new inode, then finally uses v9fs_stat2inode(), which would determine whether the inode is a block or character device. However, at that point init_special_inode() had already decided to use the block device functions, so even if the inode's mode is turned to a character device, the block functions are still used to operate on them. The attached patch simply calls init_special_inode() again for devices after parsing device node data in v9fs_stat2inode() so that the proper functions are used. Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2008-10-17 12:44:46 -05:00
Andy Adamson	ec9a05c94c	NFS: use correct fs type for v4 submounts and referrals Signed-off-by: Andy Adamson<andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-17 13:06:48 -04:00
Neil Brown	504e518953	Make nfs_file_cred more robust. As not all files have an associated open_context (e.g. device special files), it is safest to test for the existence of the open context before de-referencing it. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-17 13:06:45 -04:00
Chuck Lever	18de973530	NFS: Enable NFSv4 callback server to listen on AF_INET6 sockets Allow the NFS callback server to listen for requests via an AF_INET6 or AF_INET socket when IPv6 support is present in the kernel. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-17 13:06:41 -04:00
Arjan van de Ven	651dab4264	Merge commit 'linus/master' into merge-linus Conflicts: arch/x86/kvm/i8254.c	2008-10-17 09:20:26 -07:00
Eric Van Hensbergen	02da398b95	9p: eliminate depricated conv functions Remove depricated conv functions which have been replaced with new protocol routines. This patch also reworks the one instance of the file-system code which directly calls conversion routines (to accomplish unpacking dirreads). Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2008-10-17 11:06:57 -05:00
Eric Van Hensbergen	51a87c552d	9p: rework client code to use new protocol support functions Now that the new protocol functions are in place, this patch switches the client code to using the new support code. Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2008-10-17 11:04:45 -05:00
Eric Van Hensbergen	06b55b464e	9p: move dirread to fs layer Currently reading a directory is implemented in the client code. This function is not actually a wire operation, but a meta operation which calls read operations and processes the results. This patch moves this functionality to the fs layer and calls component wire operations instead of constructing their packets. This provides a cleaner separation and will help when we reorganize the client functions and protocol processing methods. Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2008-10-17 11:04:43 -05:00
Eric Van Hensbergen	dfb0ec2e13	9p: adjust 9p vfs write operation Currently, the 9p net wire operation ensures that all data is sent by sending multiple packets if the data requested is larger than the msize. This is better handled in the vfs code so that we can simplify wire operations to being concerned with only putting data onto and taking data off of the wire. Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2008-10-17 11:04:43 -05:00
Eric Van Hensbergen	fbedadc16e	9p: move readn meta-function from client to fs layer There are a couple of methods in the client code which aren't actually wire operations. To keep things organized cleaner, these operations are being moved to the fs layer. This patch moves the readn meta-function (which executes multiple wire reads until a buffer is full) to the fs layer. Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2008-10-17 11:04:43 -05:00
Eric Van Hensbergen	0fc9655ec6	9p: consolidate read/write functions Currently there are two separate versions of read and write. One for dealing with user buffers and the other for dealing with kernel buffers. There is a tremendous amount of code duplication in the otherwise identical versions of these functions. This patch adds an additional user buffer parameter to read and write and conditionalizes handling of the buffer on whether the kernel buffer or the user buffer is populated. Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2008-10-17 11:04:42 -05:00
Eric Van Hensbergen	8b81ef589a	9p: consolidate transport structure Right now there is a transport module structure which provides per-transport type functions and data and a transport structure which contains per-instance public data as well as function pointers to instance specific functions. This patch moves public transport visible instance data to the client structure (which in some cases had duplicate data) and consolidates the functions into the transport module structure. Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2008-10-17 11:04:41 -05:00
Geert Uytterhoeven	faa5c2a15e	[JFFS2] Correct parameter names of jffs2_compress() in comments Make the parameter names of jffs2_compress() in its comments match with the actual implementation Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2008-10-17 15:56:44 +01:00
Jeff Layton	dd1db2dedc	cifs: don't use CREATE_DELETE_ON_CLOSE in cifs_rename_pending_delete cifs: don't use CREATE_DELETE_ON_CLOSE in cifs_rename_pending_delete CREATE_DELETE_ON_CLOSE apparently has different semantics than when you set the DELETE_ON_CLOSE bit after opening the file. Setting it in the open says "delete this file as soon as this filehandle is closed". That's not what we want for cifs_rename_pending_delete. Don't set this bit in the CreateFlags. Experimentation shows that setting this flag in the SET_FILE_INFO call has no effect. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-17 14:47:13 +00:00
Randy Dunlap	496aa8a98f	block: fix current kernel-doc warnings Fix block kernel-doc warnings: Warning(linux-2.6.27-git4//fs/block_dev.c:1272): No description found for parameter 'path' Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'cpu' Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'part' Warning(/var/linsrc/linux-2.6.27-git4//block/genhd.c:544): No description found for parameter 'partno' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-17 08:46:57 +02:00
Tejun Heo	0fc71e3d65	block: add partition attribute for partition number With extended devt, finding out the partition number becomes a bit more challenging as subtracting the minor number from that of the parent device doesn't work anymore. The only thing left is parsing the partition name which is brittle and not exactly universal (some have '-' between the device name and partition number while others don't). This patch introduced partition attribute which contains the partition number of the device. This should make finding partitions and its index easier. This problem and solution were suggested by H. Peter Anvin. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-17 08:46:56 +02:00
Theodore Ts'o	f287a1a561	ext4: Remove automatic enabling of the HUGE_FILE feature flag If the HUGE_FILE feature flag is not set, don't allow the creation of large files, instead of automatically enabling the feature flag. Recent versions of mke2fs will set the HUGE_FILE flag automatically anyway for ext4 filesystems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-16 22:50:48 -04:00
Theodore Ts'o	3e624fc72f	ext4: Replace hackish ext4_mb_poll_new_transaction with commit callback The multiblock allocator needs to be able to release blocks (and issue a blkdev discard request) when the transaction which freed those blocks is committed. Previously this was done via a polling mechanism when blocks are allocated or freed. A much better way of doing things is to create a jbd2 callback function and attaching the list of blocks to be freed directly to the transaction structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-16 20:00:24 -04:00
Theodore Ts'o	01436ef2e4	ext4: Remove unused mount options: nomballoc, mballoc, nocheck These mount options don't actually do anything any more, so remove them. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-17 07:22:35 -04:00
Manish Katiyar	0b09923eab	ext4: Remove compile warnings when building w/o CONFIG_PROC_FS Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-17 14:58:45 -04:00
Eric Sesterhenn	5128273a32	ext4: Add missing newlines to printk messages There are some newlines missing in ext4_check_descriptors, which cause the printk level to be printed out when the next printk call is made: [ 778.847265] EXT4-fs: ext4_check_descriptors: Block bitmap for group 0 not in group (block 1509949442)!<3>EXT4-fs: group descriptors corrupted! [ 802.646630] EXT4-fs: ext4_check_descriptors: Inode bitmap for group 0 not in group (block 9043971)!<3>EXT4-fs: group descriptors corrupted! Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-17 09:16:19 -04:00
Linus Torvalds	52ad096465	Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (53 commits) NFS: Fix a resolution problem with nfs_inode->cache_change_attribute NFS: Fix the resolution problem with nfs_inode_attrs_need_update() NFS: Changes to inode->i_nlinks must set the NFS_INO_INVALID_ATTR flag RPC/RDMA: ensure connection attempt is complete before signalling. RPC/RDMA: correct the reconnect timer backoff RPC/RDMA: optionally emit useful transport info upon connect/disconnect. RPC/RDMA: reformat a debug printk to keep lines together. RPC/RDMA: harden connection logic against missing/late rdma_cm upcalls. RPC/RDMA: fix connect/reconnect resource leak. RPC/RDMA: return a consistent error, when connect fails. RPC/RDMA: adhere to protocol for unpadded client trailing write chunks. RPC/RDMA: avoid an oops due to disconnect racing with async upcalls. RPC/RDMA: maintain the RPC task bytes-sent statistic. RPC/RDMA: suppress retransmit on RPC/RDMA clients. RPC/RDMA: fix connection IRD/ORD setting RPC/RDMA: support FRMR client memory registration. RPC/RDMA: check selected memory registration mode at runtime. RPC/RDMA: add data types and new FRMR memory registration enum. RPC/RDMA: refactor the inline memory registration code. NFS: fix nfs_parse_ip_address() corner case ...	2008-10-16 15:39:20 -07:00
Linus Torvalds	c813b4e16e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (46 commits) UIO: Fix mapping of logical and virtual memory UIO: add automata sercos3 pci card support UIO: Change driver name of uio_pdrv UIO: Add alignment warnings for uio-mem Driver core: add bus_sort_breadthfirst() function NET: convert the phy_device file to use bus_find_device_by_name kobject: Cleanup kobject_rename and !CONFIG_SYSFS kobject: Fix kobject_rename and !CONFIG_SYSFS sysfs: Make dir and name args to sysfs_notify() const platform: add new device registration helper sysfs: use ilookup5() instead of ilookup5_nowait() PNP: create device attributes via default device attributes Driver core: make bus_find_device_by_name() more robust usb: turn dev_warn+WARN_ON combos into dev_WARN debug: use dev_WARN() rather than WARN_ON() in device_pm_add() debug: Introduce a dev_WARN() function sysfs: fix deadlock device model: Do a quickcheck for driver binding before doing an expensive check Driver core: Fix cleanup in device_create_vargs(). Driver core: Clarify device cleanup. ...	2008-10-16 12:40:26 -07:00
Linus Torvalds	c8d8a2321f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: module: remove CONFIG_KMOD in comment after #endif remove CONFIG_KMOD from fs remove CONFIG_KMOD from drivers Manually fix conflict due to include cleanups in drivers/md/md.c	2008-10-16 12:38:34 -07:00
Linus Torvalds	e4856a70cf	Merge branch 'personality' of git://git390.osdl.marist.edu/pub/scm/linux-2.6 * 'personality' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: [PATCH] remove unused ibcs2/PER_SVR4 in SET_PERSONALITY	2008-10-16 12:32:52 -07:00
Jeff Layton	469ee614aa	[CIFS] eliminate usage of kthread_stop for cifsd When cifs_demultiplex_thread was converted to a kthread based kernel thread, great pains were taken to make it so that kthread_stop would be used to bring it down. This just added unnecessary complexity since we needed to use a signal anyway to break out of kernel_recvmsg. Also, cifs_demultiplex_thread does a bit of cleanup as it's exiting, and we need to be certain that this gets done. It's possible for a kthread to exit before its main function is ever run if kthread_stop is called soon after its creation. While I'm not sure that this is a real problem with cifsd now, it could be at some point in the future if cifs_mount is ever changed to bring down the thread quickly. The upshot here is that using kthread_stop to bring down the thread just adds extra complexity with no real benefit. This patch changes the code to use the original method to bring down the thread, but still leaves it so that the thread is actually started with kthread_run. This seems to fix the deadlock caused by the reproducer in this bug report: https://bugzilla.samba.org/show_bug.cgi?id=5720 Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-16 18:46:39 +00:00
Steve French	2c1b861539	[CIFS] Add nodfs mount option Older samba server (eg. 3.0.24 from Debian etch) don't work correctly, if DFS paths are used. Such server claim that they support DFS, but fail to process some requests with DFS paths. Starting with Linux 2.6.26, the cifs clients starts sending DFS paths in such situations, rendering it unuseable with older samba servers. The nodfs mount options forces a share to be used with non DFS paths, even if the server claims, that it supports it. Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at> Acked-by: Jeff Layton <jlayton@redhat.com> Acked-by: Igor Mammedov <niallain@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-16 18:35:21 +00:00
Thomas Petazzoni	ebf3f09c63	Configure out AIO support This patchs adds the CONFIG_AIO option which allows to remove support for asynchronous I/O operations, that are not necessarly used by applications, particularly on embedded devices. As this is a size-reduction option, it depends on CONFIG_EMBEDDED. It allows to save ~7 kilobytes of kernel code/data: text data bss dec hex filename 1115067 119180 217088 1451335 162547 vmlinux 1108025 119048 217088 1444161 160941 vmlinux.new -7042 -132 0 -7174 -1C06 +/- This patch has been originally written by Matt Mackall <mpm@selenic.com>, and is part of the Linux Tiny project. [randy.dunlap@oracle.com: build fix] Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Zach Brown <zach.brown@oracle.com> Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:51 -07:00
Nick Piggin	15b4650e55	afs: convert to new aops Cannot assume writes will fully complete, so this conversion goes the easy way and always brings the page uptodate before the write. [dhowells@redhat.com: style tweaks] Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:48 -07:00
Oleg Nesterov	07edbde508	pid_ns: de_thread: kill the now unneeded ->child_reaper change de_thread() checks if the old leader was the ->child_reaper, this is not possible any longer. With the previous patch ->group_leader itself will change ->child_reaper on exit. Henceforth find_new_reaper() is the only function (apart from initialization) which plays with ->child_reaper. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:47 -07:00
Alexey Dobriyan	f40cbaa5b0	proc: move sysrq-trigger out of fs/proc/ Move it into sysrq.c, along with the rest of the sysrq implementation. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:47 -07:00
Kay Sievers	ac0d86f580	block: sanitize invalid partition table entries We currently follow blindly what the partition table lies about the disk, and let the kernel create block devices which can not be accessed. Trying to identify the device leads to kernel logs full of: sdb: rw=0, want=73392, limit=28800 attempt to access beyond end of device Here is an example of a broken partition table, where sda2 starts behind the end of the disk, and sdb3 is larger than the entire disk: Disk /dev/sdb: 14 MB, 14745600 bytes 1 heads, 29 sectors/track, 993 cylinders, total 28800 sectors Device Boot Start End Blocks Id System /dev/sdb1 29 7800 3886 83 Linux /dev/sdb2 37801 45601 3900+ 83 Linux /dev/sdb3 15602 73402 28900+ 83 Linux /dev/sdb4 23403 28796 2697 83 Linux The kernel creates these completely invalid devices, which can not be accessed, or may lead to other unpredictable failures: grep . /sys/class/block/sdb/{start,size} /sys/class/block/sdb/size:28800 /sys/class/block/sdb1/start:29 /sys/class/block/sdb1/size:7772 /sys/class/block/sdb2/start:37801 /sys/class/block/sdb2/size:7801 /sys/class/block/sdb3/start:15602 /sys/class/block/sdb3/size:57801 /sys/class/block/sdb4/start:23403 /sys/class/block/sdb4/size:5394 With this patch, we ignore partitions which start behind the end of the disk, and limit partitions to the end of the disk if they pretend to be larger: grep . /sys/class/block/sdb/{start,size} /sys/class/block/sdb/size:28800 /sys/class/block/sdb1/start:29 /sys/class/block/sdb1/size:7772 /sys/class/block/sdb3/start:15602 /sys/class/block/sdb3/size:13198 /sys/class/block/sdb4/start:23403 /sys/class/block/sdb4/size:5394 These warnings are printed to the kernel log: sdb: p2 ignored, start 37801 is behind the end of the disk sdb: p3 size 57801 limited to end of disk Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Cc: Herton Ronaldo Krzesinski <herton@mandriva.com.br> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:47 -07:00
Adrian Bunk	6722e45c2d	fs/partitions/acorn.c: remove dead code I missed this when I did the arm26 removal. Reported-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:47 -07:00
Alexey Dobriyan	4cea5ceb4c	COMPAT_BINFMT_ELF definition tweak Don't repeat BINFMT_ELF definition, simply multiply COMPAT and BINFMT_ELF. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:47 -07:00
Paul Mundt	5edc2a5123	binfmt_elf_fdpic: wire up AT_EXECFD, AT_EXECFN, AT_SECURE These auxvec entries are the only ones left unhandled out of the current base implementation. This syncs up binfmt_elf_fdpic with linux/auxvec.h and current binfmt_elf. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:46 -07:00
Paul Mundt	c7637941d1	binfmt_elf_fdpic: convert initial stack alignment to arch_align_stack() binfmt_elf_fdpic seems to have grabbed a hard-coded hack from an ancient version of binfmt_elf in order to try and fix up initial stack alignment on multi-threaded x86, which while in addition to being unused, was also pushed down beyond the first set of operations on the stack pointer, negating the entire purpose. These days, we have an architecture independent arch_align_stack(), so we switch to using that instead. Move the initial alignment up before the initial stores while we're at it. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:46 -07:00
Paul Mundt	ec23847d6c	binfmt_elf_fdpic: support auxvec base platform string Commit `483fad1c3f` ("ELF loader support for auxvec base platform string") introduced AT_BASE_PLATFORM, but only implemented it for binfmt_elf. Given that AT_VECTOR_SIZE_BASE is unconditionally enlarged for us, and it's only optionally added in for the platforms that set ELF_BASE_PLATFORM, wire it up for binfmt_elf_fdpic, too. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:46 -07:00
Adrian Bunk	b73c29f6b0	quota: remove CVS keywords Remove CVS keywords that weren't updated for a long time from comments. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:46 -07:00
Julien Brunel	67b172c097	fs/reiserfs: use an IS_ERR test rather than a NULL test In case of error, the function open_xa_dir returns an ERR pointer, but never returns a NULL pointer. So a NULL test that comes after an IS_ERR test should be deleted. The semantic match that finds this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @match_bad_null_test@ expression x, E; statement S1,S2; @@ x = open_xa_dir(...) ... when != x = E ( * if (x == NULL && ...) S1 else S2 \| * if (x == NULL \|\| ...) S1 else S2 ) // </smpl> Signed-off-by: Julien Brunel <brunel@diku.dk> Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:46 -07:00
Adrian Bunk	6b23ea7679	reiserfs/procfs.c: remove CVS keywords Remove CVS keywords that weren't updated for a long time from comments. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:46 -07:00
Eric Sesterhenn	d38b7aa7fc	hfs: fix namelength memory corruption Fix a stack corruption caused by a corrupted hfs filesystem. If the catalog name length is corrupted the memcpy overwrites the catalog btree structure. Since the field is limited to HFS_NAMELEN bytes in the structure and the file format, we throw an error if it is too long. Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:46 -07:00
Eric Sesterhenn	649f1ee6c7	hfsplus: check read_mapping_page() return value While testing more corrupted images with hfsplus, i came across one which triggered the following bug: [15840.675016] BUG: unable to handle kernel paging request at fffffffb [15840.675016] IP: [<c0116a4f>] kmap+0x15/0x56 [15840.675016] pde = 00008067 pte = 00000000 [15840.675016] Oops: 0000 [#1] PREEMPT DEBUG_PAGEALLOC [15840.675016] Modules linked in: [15840.675016] [15840.675016] Pid: 11575, comm: ln Not tainted (2.6.27-rc4-00123-gd3ee1b4-dirty #29) [15840.675016] EIP: 0060:[<c0116a4f>] EFLAGS: 00010202 CPU: 0 [15840.675016] EIP is at kmap+0x15/0x56 [15840.675016] EAX: 00000246 EBX: fffffffb ECX: 00000000 EDX: cab919c0 [15840.675016] ESI: 000007dd EDI: cab0bcf4 EBP: cab0bc98 ESP: cab0bc94 [15840.675016] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [15840.675016] Process ln (pid: 11575, ti=cab0b000 task=cab919c0 task.ti=cab0b000) [15840.675016] Stack: 00000000 cab0bcdc c0231cfb 00000000 cab0bce0 00000800 ca9290c0 fffffffb [15840.675016] cab145d0 cab919c0 cab15998 22222222 22222222 22222222 00000001 cab15960 [15840.675016] 000007dd cab0bcf4 cab0bd04 c022cb3a cab0bcf4 cab15a6c ca9290c0 00000000 [15840.675016] Call Trace: [15840.675016] [<c0231cfb>] ? hfsplus_block_allocate+0x6f/0x2d3 [15840.675016] [<c022cb3a>] ? hfsplus_file_extend+0xc4/0x1db [15840.675016] [<c022ce41>] ? hfsplus_get_block+0x8c/0x19d [15840.675016] [<c06adde4>] ? sub_preempt_count+0x9d/0xab [15840.675016] [<c019ece6>] ? __block_prepare_write+0x147/0x311 [15840.675016] [<c0161934>] ? __grab_cache_page+0x52/0x73 [15840.675016] [<c019ef4f>] ? block_write_begin+0x79/0xd5 [15840.675016] [<c022cdb5>] ? hfsplus_get_block+0x0/0x19d [15840.675016] [<c019f22a>] ? cont_write_begin+0x27f/0x2af [15840.675016] [<c022cdb5>] ? hfsplus_get_block+0x0/0x19d [15840.675016] [<c0139ebe>] ? tick_program_event+0x28/0x4c [15840.675016] [<c013bd35>] ? trace_hardirqs_off+0xb/0xd [15840.675016] [<c022b723>] ? hfsplus_write_begin+0x2d/0x32 [15840.675016] [<c022cdb5>] ? hfsplus_get_block+0x0/0x19d [15840.675016] [<c0161988>] ? pagecache_write_begin+0x33/0x107 [15840.675016] [<c01879e5>] ? __page_symlink+0x3c/0xae [15840.675016] [<c019ad34>] ? __mark_inode_dirty+0x12f/0x137 [15840.675016] [<c0187a70>] ? page_symlink+0x19/0x1e [15840.675016] [<c022e6eb>] ? hfsplus_symlink+0x41/0xa6 [15840.675016] [<c01886a9>] ? vfs_symlink+0x99/0x101 [15840.675016] [<c018a2f6>] ? sys_symlinkat+0x6b/0xad [15840.675016] [<c018a348>] ? sys_symlink+0x10/0x12 [15840.675016] [<c01038bd>] ? sysenter_do_call+0x12/0x31 [15840.675016] ======================= [15840.675016] Code: 00 00 75 10 83 3d 88 2f ec c0 02 75 07 89 d0 e8 12 56 05 00 5d c3 55 ba 06 00 00 00 89 e5 53 89 c3 b8 3d eb 7e c0 e8 16 74 00 00 <8b> 03 c1 e8 1e 69 c0 d8 02 00 00 05 b8 69 8e c0 2b 80 c4 02 00 [15840.675016] EIP: [<c0116a4f>] kmap+0x15/0x56 SS:ESP 0068:cab0bc94 [15840.675016] ---[ end trace 4fea40dad6b70e5f ]--- This happens because the return value of read_mapping_page() is passed on to kmap unchecked. The bug is triggered after the first read_mapping_page() in hfsplus_block_allocate(), this patch fixes all three usages in this functions but leaves the ones further down in the file unchanged. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:46 -07:00
Eric Sesterhenn	efc7ffcb42	hfsplus: fix Buffer overflow with a corrupted image When an hfsplus image gets corrupted it might happen that the catalog namelength field gets b0rked. If we mount such an image the memcpy() in hfsplus_cat_build_key_uni() writes more than the 255 that fit in the name field. Depending on the size of the overwritten data, we either only get memory corruption or also trigger an oops like this: [ 221.628020] BUG: unable to handle kernel paging request at c82b0000 [ 221.629066] IP: [<c022d4b1>] hfsplus_find_cat+0x10d/0x151 [ 221.629066] pde = 0ea29163 pte = 082b0160 [ 221.629066] Oops: 0002 [#1] PREEMPT DEBUG_PAGEALLOC [ 221.629066] Modules linked in: [ 221.629066] [ 221.629066] Pid: 4845, comm: mount Not tainted (2.6.27-rc4-00123-gd3ee1b4-dirty #28) [ 221.629066] EIP: 0060:[<c022d4b1>] EFLAGS: 00010206 CPU: 0 [ 221.629066] EIP is at hfsplus_find_cat+0x10d/0x151 [ 221.629066] EAX: 00000029 EBX: 00016210 ECX: 000042c2 EDX: 00000002 [ 221.629066] ESI: c82d70ca EDI: c82b0000 EBP: c82d1bcc ESP: c82d199c [ 221.629066] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 221.629066] Process mount (pid: 4845, ti=c82d1000 task=c8224060 task.ti=c82d1000) [ 221.629066] Stack: c080b3c4 c82aa8f8 c82d19c2 00016210 c080b3be c82d1bd4 c82aa8f0 00000300 [ 221.629066] 01000000 750008b1 74006e00 74006900 65006c00 c82d6400 c013bd35 c8224060 [ 221.629066] 00000036 00000046 c82d19f0 00000082 c8224548 c8224060 00000036 c0d653cc [ 221.629066] Call Trace: [ 221.629066] [<c013bd35>] ? trace_hardirqs_off+0xb/0xd [ 221.629066] [<c013bca3>] ? trace_hardirqs_off_caller+0x14/0x9b [ 221.629066] [<c013bd35>] ? trace_hardirqs_off+0xb/0xd [ 221.629066] [<c013bca3>] ? trace_hardirqs_off_caller+0x14/0x9b [ 221.629066] [<c013bd35>] ? trace_hardirqs_off+0xb/0xd [ 221.629066] [<c0107aa3>] ? native_sched_clock+0x82/0x96 [ 221.629066] [<c01302d2>] ? __kernel_text_address+0x1b/0x27 [ 221.629066] [<c010487a>] ? dump_trace+0xca/0xd6 [ 221.629066] [<c0109e32>] ? save_stack_address+0x0/0x2c [ 221.629066] [<c0109eaf>] ? save_stack_trace+0x1c/0x3a [ 221.629066] [<c013b571>] ? save_trace+0x37/0x8d [ 221.629066] [<c013b62e>] ? add_lock_to_list+0x67/0x8d [ 221.629066] [<c013ea1c>] ? validate_chain+0x8a4/0x9f4 [ 221.629066] [<c013553d>] ? down+0xc/0x2f [ 221.629066] [<c013f1f6>] ? __lock_acquire+0x68a/0x6e0 [ 221.629066] [<c013bd35>] ? trace_hardirqs_off+0xb/0xd [ 221.629066] [<c013bca3>] ? trace_hardirqs_off_caller+0x14/0x9b [ 221.629066] [<c013bd35>] ? trace_hardirqs_off+0xb/0xd [ 221.629066] [<c0107aa3>] ? native_sched_clock+0x82/0x96 [ 221.629066] [<c013da5d>] ? mark_held_locks+0x43/0x5a [ 221.629066] [<c013dc3a>] ? trace_hardirqs_on+0xb/0xd [ 221.629066] [<c013dbf4>] ? trace_hardirqs_on_caller+0xf4/0x12f [ 221.629066] [<c06abec8>] ? _spin_unlock_irqrestore+0x42/0x58 [ 221.629066] [<c013555c>] ? down+0x2b/0x2f [ 221.629066] [<c022aa68>] ? hfsplus_iget+0xa0/0x154 [ 221.629066] [<c022b0b9>] ? hfsplus_fill_super+0x280/0x447 [ 221.629066] [<c0107aa3>] ? native_sched_clock+0x82/0x96 [ 221.629066] [<c013bca3>] ? trace_hardirqs_off_caller+0x14/0x9b [ 221.629066] [<c013bca3>] ? trace_hardirqs_off_caller+0x14/0x9b [ 221.629066] [<c013f1f6>] ? __lock_acquire+0x68a/0x6e0 [ 221.629066] [<c041c9e4>] ? string+0x2b/0x74 [ 221.629066] [<c041cd16>] ? vsnprintf+0x2e9/0x512 [ 221.629066] [<c010487a>] ? dump_trace+0xca/0xd6 [ 221.629066] [<c0109eaf>] ? save_stack_trace+0x1c/0x3a [ 221.629066] [<c0109eaf>] ? save_stack_trace+0x1c/0x3a [ 221.629066] [<c013b571>] ? save_trace+0x37/0x8d [ 221.629066] [<c013b62e>] ? add_lock_to_list+0x67/0x8d [ 221.629066] [<c013ea1c>] ? validate_chain+0x8a4/0x9f4 [ 221.629066] [<c01354d3>] ? up+0xc/0x2f [ 221.629066] [<c013f1f6>] ? __lock_acquire+0x68a/0x6e0 [ 221.629066] [<c013bd35>] ? trace_hardirqs_off+0xb/0xd [ 221.629066] [<c013bca3>] ? trace_hardirqs_off_caller+0x14/0x9b [ 221.629066] [<c013bd35>] ? trace_hardirqs_off+0xb/0xd [ 221.629066] [<c0107aa3>] ? native_sched_clock+0x82/0x96 [ 221.629066] [<c041cfb7>] ? snprintf+0x1b/0x1d [ 221.629066] [<c01ba466>] ? disk_name+0x25/0x67 [ 221.629066] [<c0183960>] ? get_sb_bdev+0xcd/0x10b [ 221.629066] [<c016ad92>] ? kstrdup+0x2a/0x4c [ 221.629066] [<c022a7b3>] ? hfsplus_get_sb+0x13/0x15 [ 221.629066] [<c022ae39>] ? hfsplus_fill_super+0x0/0x447 [ 221.629066] [<c0183583>] ? vfs_kern_mount+0x3b/0x76 [ 221.629066] [<c0183602>] ? do_kern_mount+0x32/0xba [ 221.629066] [<c01960d4>] ? do_new_mount+0x46/0x74 [ 221.629066] [<c0196277>] ? do_mount+0x175/0x193 [ 221.629066] [<c013dbf4>] ? trace_hardirqs_on_caller+0xf4/0x12f [ 221.629066] [<c01663b2>] ? __get_free_pages+0x1e/0x24 [ 221.629066] [<c06ac07b>] ? lock_kernel+0x19/0x8c [ 221.629066] [<c01962e6>] ? sys_mount+0x51/0x9b [ 221.629066] [<c01962f9>] ? sys_mount+0x64/0x9b [ 221.629066] [<c01038bd>] ? sysenter_do_call+0x12/0x31 [ 221.629066] ======================= [ 221.629066] Code: 89 c2 c1 e2 08 c1 e8 08 09 c2 8b 85 e8 fd ff ff 66 89 50 06 89 c7 53 83 c7 08 56 57 68 c4 b3 80 c0 e8 8c 5c ef ff 89 d9 c1 e9 02 <f3> a5 89 d9 83 e1 03 74 02 f3 a4 83 c3 06 8b 95 e8 fd ff ff 0f [ 221.629066] EIP: [<c022d4b1>] hfsplus_find_cat+0x10d/0x151 SS:ESP 0068:c82d199c [ 221.629066] ---[ end trace e417a1d67f0d0066 ]--- Since hfsplus_cat_build_key_uni() returns void and only has one callsite, the check is performed at the callsite. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:46 -07:00
Mike Crowe	81a73719d1	hfsplus: quieten down mounting hfsplus journaled fs read only Check whether the file system was to be mounted read only anyway before warning about changing the mount to read only. Signed-off-by: Mike Crowe <mac@mcrowe.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:46 -07:00
Harvey Harrison	152b95a1ed	befs: annotate fs32 on tests for superblock endianness Does compile-time byteswapping rather than runtime. Noticed by sparse: fs/befs/super.c:29:6: warning: cast to restricted __le32 fs/befs/super.c:29:6: warning: cast from restricted fs32 fs/befs/super.c:31:11: warning: cast to restricted __be32 fs/befs/super.c:31:11: warning: cast from restricted fs32 fs/befs/super.c:31:11: warning: cast to restricted __be32 fs/befs/super.c:31:11: warning: cast from restricted fs32 fs/befs/super.c:31:11: warning: cast to restricted __be32 fs/befs/super.c:31:11: warning: cast from restricted fs32 fs/befs/super.c:31:11: warning: cast to restricted __be32 fs/befs/super.c:31:11: warning: cast from restricted fs32 fs/befs/super.c:31:11: warning: cast to restricted __be32 fs/befs/super.c:31:11: warning: cast from restricted fs32 fs/befs/super.c:31:11: warning: cast to restricted __be32 fs/befs/super.c:31:11: warning: cast from restricted fs32 fs/befs/linuxvfs.c:811:7: warning: cast to restricted __le32 fs/befs/linuxvfs.c:811:7: warning: cast from restricted fs32 fs/befs/linuxvfs.c:812:7: warning: cast to restricted __be32 fs/befs/linuxvfs.c:812:7: warning: cast from restricted fs32 fs/befs/linuxvfs.c:812:7: warning: cast to restricted __be32 fs/befs/linuxvfs.c:812:7: warning: cast from restricted fs32 fs/befs/linuxvfs.c:812:7: warning: cast to restricted __be32 fs/befs/linuxvfs.c:812:7: warning: cast from restricted fs32 fs/befs/linuxvfs.c:812:7: warning: cast to restricted __be32 fs/befs/linuxvfs.c:812:7: warning: cast from restricted fs32 fs/befs/linuxvfs.c:812:7: warning: cast to restricted __be32 fs/befs/linuxvfs.c:812:7: warning: cast from restricted fs32 fs/befs/linuxvfs.c:812:7: warning: cast to restricted __be32 fs/befs/linuxvfs.c:812:7: warning: cast from restricted fs32 Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: "Sergey S. Kostyliov" <rathamahata@php4.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:46 -07:00
Eric Sandeen	bd39597cbd	ext2: avoid printk floods in the face of directory corruption A very large directory with many read failures (either due to storage problems, or due to invalid size & blocks from corruption) will generate a printk storm as the filesystem continues to try to read all the blocks. This flood of messages can tie up the box until it is complete - which may be a very long time, especially for very large corrupted values. This is fixed by only reporting the corruption once each time we try to read the directory. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Eugene Teo <eugeneteo@kernel.sg> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:46 -07:00
Mingming Cao	d707d31c97	ext2: fix ext2 block reservation early ENOSPC issue We could run into ENOSPC error on ext2, even when there is free blocks on the filesystem. The problem is triggered in the case the goal block group has 0 free blocks , and the rest block groups are skipped due to the check of "free_blocks < windowsz/2". Current code could fall back to non reservation allocation to prevent early ENOSPC after examing all the block groups with reservation on , but this code was bypassed if the reservation window is turned off already, which is true in this case. This patch fixed two issues: 1) We don't need to turn off block reservation if the goal block group has 0 free blocks left and continue search for the rest of block groups. Current code the intention is to turn off the block reservation if the goal allocation group has a few (some) free blocks left (not enough for make the desired reservation window),to try to allocation in the goal block group, to get better locality. But if the goal blocks have 0 free blocks, it should leave the block reservation on, and continues search for the next block groups,rather than turn off block reservation completely. 2) we don't need to check the window size if the block reservation is off. The problem was originally found and fixed in ext4. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:45 -07:00
Ian Kent	8d7b48e0bc	autofs4: add miscellaneous device for ioctls Add a miscellaneous device to the autofs4 module for routing ioctls. This provides the ability to obtain an ioctl file handle for an autofs mount point that is possibly covered by another mount. The actual problem with autofs is that it can't reconnect to existing mounts. Immediately one things of just adding the ability to remount autofs file systems would solve it, but alas, that can't work. This is because autofs direct mounts and the implementation of "on demand mount and expire" of nested mount trees have the file system mounted on top of the mount trigger dentry. To resolve this a miscellaneous device node for routing ioctl commands to these mount points has been implemented in the autofs4 kernel module and a library added to autofs. This provides the ability to open a file descriptor for these over mounted autofs mount points. Please refer to Documentation/filesystems/autofs4-mount-control.txt for a discussion of the problem, implementation alternatives considered and a description of the interface. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: build fix] Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:39 -07:00
Ian Kent	c0f54d3e54	autofs4: track uid and gid of last mount requester Track the uid and gid of the last process to request a mount for on an autofs dentry. [akpm@linux-foundation.org: fix tpyo in comment] Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:39 -07:00
Ian Kent	bb979d7fc3	autofs4: cleanup autofs mount type usage Usage of the AUTOFS_TYPE_* defines is a little confusing and appears inconsistent. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:39 -07:00
Tyler Hicks	624ae52845	eCryptfs: remove netlink transport The netlink transport code has not worked for a while and the miscdev transport is a simpler solution. This patch removes the netlink code and makes the miscdev transport the only eCryptfs kernel to userspace transport. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <kirkland@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:39 -07:00
Badari Pulavarty	807b7ebe41	ecryptfs: convert to use new aops Convert ecryptfs to use write_begin/write_end Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:39 -07:00
Michael Halcrow	7d6c704558	eCryptfs: remove retry loop in ecryptfs_readdir() The retry block in ecryptfs_readdir() has been in the eCryptfs code base for a while, apparently for no good reason. This loop could potentially run without terminating. This patch removes the loop, instead erroring out if vfs_readdir() on the lower file fails. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Reported-by: Al Viro <viro@ZinIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:38 -07:00
Kirill A. Shutemov	bf2a9a3963	Allow recursion in binfmt_script and binfmt_misc binfmt_script and binfmt_misc disallow recursion to avoid stack overflow using sh_bang and misc_bang. It causes problem in some cases: $ echo '#!/bin/ls' > /tmp/t0 $ echo '#!/tmp/t0' > /tmp/t1 $ echo '#!/tmp/t1' > /tmp/t2 $ chmod +x /tmp/t* $ /tmp/t2 zsh: exec format error: /tmp/t2 Similar problem with binfmt_misc. This patch introduces field 'recursion_depth' into struct linux_binprm to track recursion level in binfmt_misc and binfmt_script. If recursion level more then BINPRM_MAX_RECURSION it generates -ENOEXEC. [akpm@linux-foundation.org: make linux_binprm.recursion_depth a uint] Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:38 -07:00
Kirill A. Shutemov	53112488be	alpha: introduce field 'taso' into struct linux_binprm This change is Alpha-specific. It adds field 'taso' into struct linux_binprm to remember if the application is TASO. Previously, field sh_bang was used for this purpose. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:38 -07:00
Adrian Bunk	cde162c2a9	binfmt_som.c: add MODULE_LICENSE Add the missing MODULE_LICENSE("GPL"). Reported-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Grant Grundler <grundler@parisc-linux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:38 -07:00
Christoph Hellwig	f7a5000f7a	compat: move cp_compat_stat to common code struct stat / compat_stat is the same on all architectures, so cp_compat_stat should be, too. Turns out it is, except that various architectures have slightly and some high2lowuid/high2lowgid or the direct assignment instead of the SET_UID/SET_GID that expands to the correct one anyway. This patch replaces the arch-specific cp_compat_stat implementations with a common one based on the x86-64 one. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David S. Miller <davem@davemloft.net> [ sparc bits ] Acked-by: Kyle McMartin <kyle@mcmartin.ca> [ parisc bits ] Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:33 -07:00
Francois Cami	e1f8e87449	Remove Andrew Morton's old email accounts People can use the real name an an index into MAINTAINERS to find the current email address. Signed-off-by: Francois Cami <francois.cami@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:32 -07:00
Davide Libenzi	f337b9c583	epoll: drop unnecessary test Thomas found that there is an unnecessary (always true) test in ep_send_events(). The callback never inserts into ->rdllink while the send loop is performed, and also does the ~EP_PRIVATE_BITS test. Given we're holding the mutex during this time, the conditions tested inside the loop are always true. This patch drops the test done inside the re-insertion loop. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:32 -07:00
Jason Baron	362e6663ef	exec.c, compat.c: fix count(), compat_count() bounds checking With MAX_ARG_STRINGS set to 0x7FFFFFFF, and being passed to 'count()' and compat_count(), it would appear that the current max bounds check of fs/exec.c:394: if(++i > max) return -E2BIG; would never trigger. Since 'i' is of type int, so values would wrap and the function would continue looping. Simple fix seems to be chaning ++i to i++ and checking for '>='. Signed-off-by: Jason Baron <jbaron@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Ollie Wild" <aaw@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:32 -07:00
Volodymyr G. Lukiianyk	f4cfb18d79	uclinux: fix gzip header parsing in binfmt_flat.c There are off-by-one errors in decompress_exec() when calculating the length of optional "original file name" and "comment" fields: the "ret" index is not incremented when terminating '\0' character is reached. The check of the buffer overflow (after an "extra-field" length was taken into account) is also fixed. I've encountered this off-by-one error when tried to reuse gzip-header-parsing part of the decompress_exec() function. There was an "original file name" field in the payload (with miscalculated length) and zlib_inflate() returned Z_DATA_ERROR. But after the fix similar to this one all worked fine. Signed-off-by: Volodymyr G Lukiianyk <volodymyrgl@gmail.com> Acked-by: Greg Ungerer <gerg@snapgear.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:29 -07:00
Eric W. Biederman	0b4a4fea25	kobject: Cleanup kobject_rename and !CONFIG_SYSFS It finally dawned on me what the clean fix to sysfs_rename_dir calling kobject_set_name is. Move the work into kobject_rename where it belongs. The callers serialize us anyway so this is safe. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-16 09:24:52 -07:00
Trent Piepho	8c0e3998f5	sysfs: Make dir and name args to sysfs_notify() const Because they can be, and because code like this produces a warning if they're not: struct device_attribute dev_attr; sysfs_notify(&kobj, NULL, dev_attr.attr.name); Signed-off-by: Trent Piepho <tpiepho@freescale.com> CC: Neil Brown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-16 09:24:51 -07:00
Tejun Heo	45c076c5d7	sysfs: use ilookup5() instead of ilookup5_nowait() As inode creation is protected by sysfs_mutex, ilookup5_nowait() always either fails to find at all or finds one which is fully initialized, so using ilookup5_nowait() or ilookup5() doesn't make any difference. Switch to ilookup5() as it's planned to be removed. This change also makes lookup return value handling a bit simpler. This change was suggested by Al Viro. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Al Viro <viro@hera.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-16 09:24:51 -07:00
Nick Piggin	b31ca3f5df	sysfs: fix deadlock On Thu, Sep 11, 2008 at 10:27:10AM +0200, Ingo Molnar wrote: > and it's working fine on most boxes. One testbox found this new locking > scenario: > > PM: Adding info for No Bus:vcsa7 > EDAC DEBUG: MC0: i82860_check() > > ======================================================= > [ INFO: possible circular locking dependency detected ] > 2.6.27-rc6-tip #1 > ------------------------------------------------------- > X/4873 is trying to acquire lock: > (&bb->mutex){--..}, at: [<c020ba20>] mmap+0x40/0xa0 > > but task is already holding lock: > (&mm->mmap_sem){----}, at: [<c0125a1e>] sys_mmap2+0x8e/0xc0 > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #1 (&mm->mmap_sem){----}: > [<c017dc96>] validate_chain+0xa96/0xf50 > [<c017ef2b>] __lock_acquire+0x2cb/0x5b0 > [<c017f299>] lock_acquire+0x89/0xc0 > [<c01aa8fb>] might_fault+0x6b/0x90 > [<c040b618>] copy_to_user+0x38/0x60 > [<c020bcfb>] read+0xfb/0x170 > [<c01c09a5>] vfs_read+0x95/0x110 > [<c01c1443>] sys_pread64+0x63/0x80 > [<c012146f>] sysenter_do_call+0x12/0x43 > [<ffffffff>] 0xffffffff > > -> #0 (&bb->mutex){--..}: > [<c017d8b7>] validate_chain+0x6b7/0xf50 > [<c017ef2b>] __lock_acquire+0x2cb/0x5b0 > [<c017f299>] lock_acquire+0x89/0xc0 > [<c0d6f2ab>] __mutex_lock_common+0xab/0x3c0 > [<c0d6f698>] mutex_lock_nested+0x38/0x50 > [<c020ba20>] mmap+0x40/0xa0 > [<c01b111e>] mmap_region+0x14e/0x450 > [<c01b170f>] do_mmap_pgoff+0x2ef/0x310 > [<c0125a3d>] sys_mmap2+0xad/0xc0 > [<c012146f>] sysenter_do_call+0x12/0x43 > [<ffffffff>] 0xffffffff > > other info that might help us debug this: > > 1 lock held by X/4873: > #0: (&mm->mmap_sem){----}, at: [<c0125a1e>] sys_mmap2+0x8e/0xc0 > > stack backtrace: > Pid: 4873, comm: X Not tainted 2.6.27-rc6-tip #1 > [<c017cd09>] print_circular_bug_tail+0x79/0xc0 > [<c017d8b7>] validate_chain+0x6b7/0xf50 > [<c017a5b5>] ? trace_hardirqs_off_caller+0x15/0xb0 > [<c017ef2b>] __lock_acquire+0x2cb/0x5b0 > [<c017f299>] lock_acquire+0x89/0xc0 > [<c020ba20>] ? mmap+0x40/0xa0 > [<c0d6f2ab>] __mutex_lock_common+0xab/0x3c0 > [<c020ba20>] ? mmap+0x40/0xa0 > [<c0d6f698>] mutex_lock_nested+0x38/0x50 > [<c020ba20>] ? mmap+0x40/0xa0 > [<c020ba20>] mmap+0x40/0xa0 > [<c01b111e>] mmap_region+0x14e/0x450 > [<c01afb88>] ? arch_get_unmapped_area_topdown+0xf8/0x160 > [<c01b170f>] do_mmap_pgoff+0x2ef/0x310 > [<c0125a3d>] sys_mmap2+0xad/0xc0 > [<c012146f>] sysenter_do_call+0x12/0x43 > [<c0120000>] ? __switch_to+0x130/0x220 > ======================= > evbug.c: Event. Dev: input3, Type: 20, Code: 0, Value: 500 > warning: `sudo' uses deprecated v2 capabilities in a way that may be insecure. > > i've attached the config. > > at first sight it looks like a genuine bug in fs/sysfs/bin.c? Yes, it is a real bug by the looks. bin.c takes bb->mutex under mmap_sem when it is mmapped, and then does its copy_*_user under bb->mutex too. Here is a basic fix for the sysfs lor. From: Nick Piggin <npiggin@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-16 09:24:50 -07:00
Neil Brown	f1282c844e	sysfs: Support sysfs_notify from atomic context with new sysfs_notify_dirent Support sysfs_notify from atomic context with new sysfs_notify_dirent sysfs_notify currently takes sysfs_mutex. This means that it cannot be called in atomic context. sysfs_mutex is sometimes held over a malloc (sysfs_rename_dir) so it can block on low memory. In md I want to be able to notify on a sysfs attribute from atomic context, and I don't want to block on low memory because I could be in the writeout path for freeing memory. So: - export the "sysfs_dirent" structure along with sysfs_get, sysfs_put and sysfs_get_dirent so I can get the sysfs_dirent that I want to notify on and hold it in an md structure. - split sysfs_notify_dirent out of sysfs_notify so the sysfs_dirent can be notified on with no blocking (just a spinlock). Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-16 09:24:47 -07:00
Greg Kroah-Hartman	a9b12619f7	device create: misc: convert device_create_drvdata to device_create Now that device_create() has been audited, rename things back to the original call to be sane. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-16 09:24:43 -07:00
Andrew Morton	ae87221d3c	sysfs: crash debugging Print the name of the last-accessed sysfs file when we oops, to help track down oopses which occur in sysfs store/read handlers. Because these oopses tend to not leave any trace of the offending code in the stack traces. Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-16 09:24:41 -07:00
Johannes Berg	5f4123be3c	remove CONFIG_KMOD from fs Just always compile the code when the kernel is modular. Convert load_nls to use try_then_request_module to tidy up the code. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-17 02:38:36 +11:00
Thomas Gleixner	2be3b52a57	proc: fixup irq iterator There is no need for irq_desc here. Even for sparse_irq we can handle this clever in for_each_irq_nr(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:30 +02:00
Thomas Gleixner	2cc21ef843	genirq: remove sparse irq code This code is not ready, but we need to rip it out instead of rebasing as we would lose the APIC/IO_APIC unification otherwise. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:15 +02:00
Yinghai Lu	6d50bc2683	x86: use 28 bits irq NR for pci msi/msix and ht also print out irq no in /proc/interrups and /proc/stat in hex, so could tell bus/dev/func. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:52 +02:00
Yinghai Lu	52b17329d6	x86_64: make /proc/interrupts work with dyn irq_desc loop with irq_desc list Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:51 +02:00
Yinghai Lu	c7fb03a475	irq, fs/proc: replace loop with nr_irqs for proc/stat Replace another nr_irqs loop to avoid the allocation of all sparse irq entries - use for_each_irq_desc instead. v2: make sure arch without GENERIC_HARDIRQS works too Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:33 +02:00
Yinghai Lu	7f95ec9e4c	x86: move kstat_irqs from kstat to irq_desc based on Eric's patch ... together mold it with dyn_array for irq_desc, will allcate kstat_irqs for nr_irq_desc alltogether if needed. -- at that point nr_cpus is known already. v2: make sure system without generic_hardirqs works they don't have irq_desc v3: fix merging v4: [mingo@elte.hu] fix typo [ mingo@elte.hu ] irq: build fix fix: arch/x86/xen/spinlock.c: In function 'xen_spin_lock_slow': arch/x86/xen/spinlock.c:90: error: 'struct kernel_stat' has no member named 'irqs' Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:32 +02:00
Yinghai Lu	da27c118eb	fs/proc: use nr_irqs Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:07 +02:00
Aneesh Kumar K.V	22208dedbd	ext4: Fix file fragmentation during large file write. The range_cyclic writeback mode uses the address_space writeback_index as the start index for writeback. With delayed allocation we were updating writeback_index wrongly resulting in highly fragmented file. This patch reduces the number of extents reduced from 4000 to 27 for a 3GB file. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-16 10:10:36 -04:00
Tejun Heo	a7c1b990f7	fuse: implement nonseekable open Let the client request nonseekable open using FOPEN_NONSEEKABLE and call nonseekable_open() on the file if requested. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2008-10-16 16:08:57 +02:00
Tejun Heo	29d434b39c	fuse: add include protectors Add include protectors to include/linux/fuse.h and fs/fuse/fuse_i.h. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2008-10-16 16:08:57 +02:00
Robert P. J. Day	37194d0723	fuse: config description improvement Make the short description of the FUSE_FS config option clearer. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2008-10-16 16:08:57 +02:00
Julia Lawall	17e18ab6ff	fuse: add missing fuse_request_free The error handling code for the second call to fuse_request_alloc should include freeing the result of the first one. This bug was found by the Coccinelle project: http://www.emn.fr/x-info/coccinelle/ Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2008-10-16 16:08:56 +02:00
Miklos Szeredi	769415c611	fuse: fix SEEK_END incorrectness Update file size before using it in lseek(..., SEEK_END). Reported-by: Amnon Shiloh <u3557@miso.sublimeip.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2008-10-16 16:08:56 +02:00
Martin Schwidefsky	0b59268285	[PATCH] remove unused ibcs2/PER_SVR4 in SET_PERSONALITY The SET_PERSONALITY macro is always called with a second argument of 0. Remove the ibcs argument and the various tests to set the PER_SVR4 personality. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	2008-10-16 15:40:05 +02:00
Trond Myklebust	6925bac120	Merge branch 'next'	2008-10-15 15:54:56 -04:00
Christoph Hellwig	6c5e51dae2	xfs: fix remount rw with unrecognized options When we skip unrecognized options in xfs_fs_remount we should just break out of the switch and not return because otherwise we may skip clearing the xfs-internal read-only flag. This will only show up on some operations like touch because most read-only checks are done by the VFS which thinks this filesystem is r/w. Eventually we should replace the XFS read-only flag with a helper that always checks the VFS flag to make sure they can never get out of sync. Bug reported and fix verified by Marcel Beister on #xfs. Bug fix verified by updated xfstests/189. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Timothy Shimmin <tes@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-15 10:00:00 -07:00
Mark Fasheh	1efd47f873	ocfs2: fix build error I merged the latest ocfs2_read_blocks() changes in xattr.c wrong. This makes Ocfs2 compile again. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-14 18:31:46 -07:00
Linus Torvalds	acd15a8360	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (56 commits) ocfs2: Make cached block reads the common case. ocfs2: Kill the last naked wait_on_buffer() for cached reads. ocfs2: Move ocfs2_bread() into dir.c ocfs2: Simplify ocfs2_read_block() ocfs2: Require an inode for ocfs2_read_block(s)(). ocfs2: Separate out sync reads from ocfs2_read_blocks() ocfs2: Refactor xattr list and remove ocfs2_xattr_handler(). ocfs2: Calculate EA hash only by its suffix. ocfs2: Move trusted and user attribute support into xattr.c ocfs2: Uninline ocfs2_xattr_name_hash() ocfs2: Don't check for NULL before brelse() ocfs2: use smaller counters in ocfs2_remove_xattr_clusters_from_cache ocfs2: Documentation update for user_xattr / nouser_xattr mount options ocfs2: make la_debug_mutex static ocfs2: Remove pointless !! ocfs2: Add empty bucket support in xattr. ocfs2/xattr.c: Fix a bug when inserting xattr. ocfs2: Add xattr mount option in ocfs2_show_options() ocfs2: Switch over to JBD2. ocfs2: Add the 'inode64' mount option. ...	2008-10-14 16:34:11 -07:00
Trond Myklebust	011935a0a7	NFS: Fix a resolution problem with nfs_inode->cache_change_attribute The cache_change_attribute is used to decide whether or not a directory has changed, in which case we may need to look it up again. Again, the use of 'jiffies' leads to an issue of resolution. Once again, the fix is to change nfs_inode->cache_change_attribute, and just make it a simple counter. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-14 19:24:50 -04:00
Trond Myklebust	4704f0e274	NFS: Fix the resolution problem with nfs_inode_attrs_need_update() It appears that 'jiffies' timestamps do not have high enough resolution for nfs_inode_attrs_need_update(). One problem is that a GETATTR can be launched within < 1 jiffy of the last operation that updated the attribute. Another problem is that RPC calls can take < 1 jiffy to execute. We can fix this by switching the variables to use a simple global counter that gets incremented every time we start another GETATTR call. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-14 19:23:17 -04:00
Trond Myklebust	921615f111	NFS: Changes to inode->i_nlinks must set the NFS_INO_INVALID_ATTR flag Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-14 19:23:07 -04:00
Linus Torvalds	8acd3a60bc	Merge branch 'for-2.6.28' of git://linux-nfs.org/~bfields/linux * 'for-2.6.28' of git://linux-nfs.org/~bfields/linux: (59 commits) svcrdma: Fix IRD/ORD polarity svcrdma: Update svc_rdma_send_error to use DMA LKEY svcrdma: Modify the RPC reply path to use FRMR when available svcrdma: Modify the RPC recv path to use FRMR when available svcrdma: Add support to svc_rdma_send to handle chained WR svcrdma: Modify post recv path to use local dma key svcrdma: Add a service to register a Fast Reg MR with the device svcrdma: Query device for Fast Reg support during connection setup svcrdma: Add FRMR get/put services NLM: Remove unused argument from svc_addsock() function NLM: Remove "proto" argument from lockd_up() NLM: Always start both UDP and TCP listeners lockd: Remove unused fields in the nlm_reboot structure lockd: Add helper to sanity check incoming NOTIFY requests lockd: change nlmclnt_grant() to take a "struct sockaddr *" lockd: Adjust nlmsvc_lookup_host() to accomodate AF_INET6 addresses lockd: Adjust nlmclnt_lookup_host() signature to accomodate non-AF_INET lockd: Support non-AF_INET addresses in nlm_lookup_host() NLM: Convert nlm_lookup_host() to use a single argument svcrdma: Add Fast Reg MR Data Types ...	2008-10-14 12:31:14 -07:00
Joel Becker	d4a8c93c82	ocfs2: Make cached block reads the common case. ocfs2_read_blocks() currently requires the CACHED flag for cached I/O. However, that's the common case. Let's flip it around and provide an IGNORE_CACHE flag for the special users. This has the added benefit of cleaning up the code some (ignore_cache takes on its special meaning earlier in the loop). Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-14 11:58:22 -07:00
Joel Becker	5e0b3dec01	ocfs2: Kill the last naked wait_on_buffer() for cached reads. ocfs2's cached buffer I/O goes through ocfs2_read_block(s)(). dir.c had a naked wait_on_buffer() to wait for some readahead, but it should use ocfs2_read_block() instead. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-14 11:58:11 -07:00
Joel Becker	07446dc72c	ocfs2: Move ocfs2_bread() into dir.c dir.c is the only place using ocfs2_bread(), so let's make it static to that file. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-14 11:58:03 -07:00
Joel Becker	0fcaa56a2a	ocfs2: Simplify ocfs2_read_block() More than 30 callers of ocfs2_read_block() pass exactly OCFS2_BH_CACHED. Only six pass a different flag set. Rather than have every caller care, let's make ocfs2_read_block() take no flags and always do a cached read. The remaining six places can call ocfs2_read_blocks() directly. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-14 11:51:57 -07:00
Joel Becker	31d33073ca	ocfs2: Require an inode for ocfs2_read_block(s)(). Now that synchronous readers are using ocfs2_read_blocks_sync(), all callers of ocfs2_read_blocks() are passing an inode. Use it unconditionally. Since it's there, we don't need to pass the ocfs2_super either. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-14 11:43:29 -07:00
Joel Becker	da1e90985a	ocfs2: Separate out sync reads from ocfs2_read_blocks() The ocfs2_read_blocks() function currently handles sync reads, cached, reads, and sometimes cached reads. We're going to add some functionality to it, so first we should simplify it. The uncached, synchronous reads are much easer to handle as a separate function, so we instroduce ocfs2_read_blocks_sync(). Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-14 11:29:10 -07:00
Aneesh Kumar K.V	af6f029d38	ext4: Use tag dirty lookup during mpage_da_submit_io This enables us to drop the range_cont writeback mode use from ext4_da_writepages. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2008-10-14 09:20:19 -04:00
Theodore Ts'o	8a0aba733d	ext4: let the block device know when unused blocks can be discarded Let the block device know when unused blocks can be discarded, using the new sb_issue_discard() interface. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-16 10:06:27 -04:00
Tao Ma	936b883436	ocfs2: Refactor xattr list and remove ocfs2_xattr_handler(). According to Christoph Hellwig's advice, we really don't need a ->list to handle one xattr's list. Just a map from index to xattr prefix is enough. And I also refactor the old list method with the reference from fs/xfs/linux-2.6/xfs_xattr.c and the xattr list method in btrfs. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 17:02:45 -07:00
Tao Ma	2057e5c678	ocfs2: Calculate EA hash only by its suffix. According to Christoph Hellwig's advice, the hash value of EA is only calculated by its suffix. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 17:02:44 -07:00
Mark Fasheh	99219aea68	ocfs2: Move trusted and user attribute support into xattr.c Per Christoph Hellwig's suggestion - don't split these up. It's not like we gained much by having the two tiny files around. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 17:02:44 -07:00
Mark Fasheh	40daa16a34	ocfs2: Uninline ocfs2_xattr_name_hash() This is too big to be inlined. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 17:02:44 -07:00
Mark Fasheh	a81cb88b64	ocfs2: Don't check for NULL before brelse() This is pointless as brelse() already does the check. Signed-off-by: Mark Fasheh	2008-10-13 17:02:44 -07:00
Mark Fasheh	fd8351f83d	ocfs2: use smaller counters in ocfs2_remove_xattr_clusters_from_cache i and b_len don't really need to be u64's. Xattr extent lengths should be limited by the VFS, and then the size of our on-disk length field. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 17:02:44 -07:00
Mark Fasheh	4cc8124584	ocfs2: make la_debug_mutex static It can also be moved into ocfs2_la_debug_read(). Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 17:02:44 -07:00
Mark Fasheh	009d37502a	ocfs2: Remove pointless !! ocfs2_stack_supports_plocks() doesn't need this to properly return a zero or one value. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 17:02:44 -07:00
Tao Ma	5a09561199	ocfs2: Add empty bucket support in xattr. As Mark mentioned, it may be time-consuming when we remove the empty xattr bucket, so this patch try to let empty bucket exist in xattr operation. The modification includes: 1. Remove the functin of bucket and extent record deletion during xattr delete. 2. In xattr set: 1) Don't clean the last entry so that if the bucket is empty, the hash value of the bucket is the hash value of the entry which is deleted last. 2) During insert, if we meet with an empty bucket, just use the 1st entry. 3. In binary search of xattr bucket, use the bucket hash value(which stored in the 1st xattr entry) to find the right place. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 17:02:43 -07:00
Tao Ma	06b240d8af	ocfs2/xattr.c: Fix a bug when inserting xattr. During the process of xatt insertion, we use binary search to find the right place and "low" is set to it. But when there is one xattr which has the same name hash as the inserted one, low is the wrong value. So set it to the right position. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 17:02:43 -07:00
Sunil Mushran	b0f73cfc36	ocfs2: Add xattr mount option in ocfs2_show_options() Patch adds check for [no]user_xattr in ocfs2_show_options() that completes the list of all mount options. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 17:02:43 -07:00
Joel Becker	2b4e30fbde	ocfs2: Switch over to JBD2. ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is limiting our maximum filesystem size. It's a pretty trivial change. Most functions are just renamed. The only functional change is moving to Jan's inode-based ordered data mode. It's better, too. Because JBD2 reads and writes JBD journals, this is compatible with any existing filesystem. It can even interact with JBD-based ocfs2 as long as the journal is formated for JBD. We provide a compatibility option so that paranoid people can still use JBD for the time being. This will go away shortly. [ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to ocfs2_truncate_for_delete(). --Mark ] Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 17:02:43 -07:00
Joel Becker	12462f1d9f	ocfs2: Add the 'inode64' mount option. Now that ocfs2 limits inode numbers to 32bits, add a mount option to disable the limit. This parallels XFS. 64bit systems can handle the larger inode numbers. [ Added description of inode64 mount option in ocfs2.txt. --Mark ] Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:08 -07:00
Joel Becker	1187c96885	ocfs2: Limit inode allocation to 32bits. ocfs2 inode numbers are block numbers. For any filesystem with less than 2^32 blocks, this is not a problem. However, when ocfs2 starts using JDB2, it will be able to support filesystems with more than 2^32 blocks. This would result in inode numbers higher than 2^32. The problem is that stat(2) can't handle those numbers on 32bit machines. The simple solution is to have ocfs2 allocate all inodes below that boundary. The suballoc code is changed to honor an optional block limit. Only the inode suballocator sets that limit - all other allocations stay unlimited. The biggest trick is to grow the inode suballocator beneath that limit. There's no point in allocating block groups that are above the limit, then rejecting their elements later on. We want to prevent the inode allocator from ever having block groups above the limit. This involves a little gyration with the local alloc code. If the local alloc window is above the limit, it signals the caller to try the global bitmap but does not disable the local alloc file (which can be used for other allocations). [ Minor cleanup - removed an ML_NOTICE comment. --Mark ] Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:07 -07:00
Tao Ma	08413899db	ocfs2: Resolve deadlock in ocfs2_xattr_free_block. In ocfs2_xattr_free_block, we take a cluster lock on xb_alloc_inode while we have a transaction open. This will deadlock the downconvert thread, so fix it. We can clean up how xattr blocks are removed while here - this patch also moves the mechanism of releasing xattr block (including both value, xattr tree and xattr block) into this function. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:06 -07:00
Tao Ma	28b8ca0b7f	ocfs2: bug-fix for journal extend in xattr. In ocfs2_extend_trans, when we can't extend the current transaction, it will commit current transaction and restart a new one. So if the previous credits we have allocated aren't used(the block isn't dirtied before our extend), we will not have enough credits for any future operation(it will cause jbd complain and bug out). So check this and re-extend it. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:06 -07:00
Joel Becker	8d6220d6a7	ocfs2: Change ocfs2_get__extent_tree() to ocfs2_init__extent_tree() The original get/put_extent_tree() functions held a reference on et_root_bh. However, every single caller already has a safe reference, making the get/put cycle irrelevant. We change ocfs2_get__extent_tree() to ocfs2_init__extent_tree(). It no longer gets a reference on et_root_bh. ocfs2_put_extent_tree() is removed. Callers now have a simpler init+use pattern. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:05 -07:00
Joel Becker	1625f8ac15	ocfs2: Comment struct ocfs2_extent_tree_operations. struct ocfs2_extent_tree_operations provides methods for the different on-disk btrees in ocfs2. Describing what those methods do is probably a good idea. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:05 -07:00
Joel Becker	f99b9b7ccf	ocfs2: Make ocfs2_extent_tree the first-class representation of a tree. We now have three different kinds of extent trees in ocfs2: inode data (dinode), extended attributes (xattr_tree), and extended attribute values (xattr_value). There is a nice abstraction for them, ocfs2_extent_tree, but it is hidden in alloc.c. All the calling functions have to pick amongst a varied API and pass in type bits and often extraneous pointers. A better way is to make ocfs2_extent_tree a first-class object. Everyone converts their object to an ocfs2_extent_tree() via the ocfs2_get_*_extent_tree() calls, then uses the ocfs2_extent_tree for all tree calls to alloc.c. This simplifies a lot of callers, making for readability. It also provides an easy way to add additional extent tree types, as they only need to be defined in alloc.c with a ocfs2_get_<new>_extent_tree() function. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:05 -07:00
Joel Becker	1e61ee79e2	ocfs2: Add an insertion check to ocfs2_extent_tree_operations. A couple places check an extent_tree for a valid inode. We move that out to add an eo_insert_check() operation. It can be called from ocfs2_insert_extent() and elsewhere. We also have the wrapper calls ocfs2_et_insert_check() and ocfs2_et_sanity_check() ignore NULL ops. That way we don't have to provide useless operations for xattr types. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:05 -07:00
Joel Becker	1a09f556e5	ocfs2: Create specific get_extent_tree functions. A caller knows what kind of extent tree they have. There's no reason they have to call ocfs2_get_extent_tree() with a NULL when they could just as easily call a specific function to their type of extent tree. Introduce ocfs2_dinode_get_extent_tree(), ocfs2_xattr_tree_get_extent_tree(), and ocfs2_xattr_value_get_extent_tree(). They only take the necessary arguments, calling into the underlying __ocfs2_get_extent_tree() to do the real work. __ocfs2_get_extent_tree() is the old ocfs2_get_extent_tree(), but without needing any switch-by-type logic. ocfs2_get_extent_tree() is now a wrapper around the specific calls. It exists because a couple alloc.c functions can take et_type. This will go later. Another benefit is that ocfs2_xattr_value_get_extent_tree() can take a struct ocfs2_xattr_value_root* instead of void*. This gives us typechecking where we didn't have it before. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:05 -07:00
Joel Becker	943cced39e	ocfs2: Determine an extent tree's max_leaf_clusters in an et_op. Provide an optional extent_tree_operation to specify the max_leaf_clusters of an ocfs2_extent_tree. If not provided, the value is 0 (unlimited). Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Joel Becker	1c25d93a4a	ocfs2: Use struct ocfs2_extent_tree in ocfs2_num_free_extents(). ocfs2_num_free_extents() re-implements the logic of ocfs2_get_extent_tree(). Now that ocfs2_get_extent_tree() does not allocate, let's use it in ocfs2_num_free_extents() to simplify the code. The inode validation code in ocfs2_num_free_extents() is not needed. All callers are passing in pre-validated inodes. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Joel Becker	0ce1010f1a	ocfs2: Provide the get_root_el() method to ocfs2_extent_tree_operations. The root_el of an ocfs2_extent_tree needs to be calculated from et->et_object. Make it an operation on et->et_ops. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Joel Becker	ea5efa1512	ocfs2: Make 'private' into 'object' on ocfs2_extent_tree. The 'private' pointer was a way to store off xattr values, which don't live at a set place in the bh. But the concept of "the object containing the extent tree" is much more generic. For an inode it's the struct ocfs2_dinode, for an xattr value its the value. Let's save off the 'object' at all times. If NULL is passed to ocfs2_get_extent_tree(), 'object' is set to bh->b_data; Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Joel Becker	dc0ce61af4	ocfs2: Make ocfs2_extent_tree get/put instead of alloc. Rather than allocating a struct ocfs2_extent_tree, just put it on the stack. Fill it with ocfs2_get_extent_tree() and drop it with ocfs2_put_extent_tree(). Now the callers don't have to ENOMEM, yet still safely ref the root_bh. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Joel Becker	ce1d9ea621	ocfs2: Prefix the ocfs2_extent_tree structure. The members of the ocfs2_extent_tree structure gain a prefix of 'et_'. All users are updated. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Joel Becker	35dc0aa3c5	ocfs2: Prefix the extent tree operations structure. The ocfs2_extent_tree_operations structure gains a field prefix on its members. The ->eo_sanity_check() operation gains a wrapper function for completeness. All of the extent tree operation wrappers gain a consistent name (ocfs2_et_*()). Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Mark Fasheh	ff1ec20ef6	ocfs2: fix printk format warnings This patch fixes the following build warnings: fs/ocfs2/xattr.c: In function 'ocfs2_half_xattr_bucket': fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int' fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int' fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int' fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int' fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int' fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int' fs/ocfs2/xattr.c: In function 'ocfs2_xattr_set_entry_in_bucket': fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t' fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t' fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t' Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tiger Yang	8154da3d21	ocfs2: Add incompatible flag for extended attribute This patch adds the s_incompat flag for extended attribute support. This helps us ensure that older versions of Ocfs2 or ocfs2-tools will not be able to mount a volume with xattr support. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tao Ma	a394425643	ocfs2: Delete all xattr buckets during inode removal In inode removal, we need to iterate all the buckets, remove any externally-stored EA values and delete the xattr buckets. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tao Ma	012255961c	ocfs2: Enable xattr set in index btree Where the previous patches added the ability of list/get xattr in buckets for ocfs2, this patch enables ocfs2 to store large numbers of EAs. The original design doc is written by Mark Fasheh, and it can be found in http://oss.oracle.com/osswiki/OCFS2/DesignDocs/IndexedEATrees. I only had to make small modifications to it. First, because the bucket size is 4K, a new field named xh_free_start is added in ocfs2_xattr_header to indicate the next valid name/value offset in a bucket. It is used when we store new EA name/value. With this field, we can find the place more quickly and what's more, we don't need to sort the name/value every time to let the last entry indicate the next unused space. This makes the insert operation more efficient for blocksizes smaller than 4k. Because of the new xh_free_start, another field named as xh_name_value_len is also added in ocfs2_xattr_header. It records the total length of all the name/values in the bucket. We need this so that we can check it and defragment the bucket if there is not enough contiguous free space. An xattr insertion looks like this: 1. xattr_index_block_find: find the right bucket by the name_hash, say bucketA. 2. check whether there is enough space in bucketA. If yes, insert it directly and modify xh_free_start and xh_name_value_len accordingly. If not, check xh_name_value_len to see whether we can store this by defragment the bucket. If yes, defragment it and go on insertion. 3. If defragement doesn't work, check whether there is new empty bucket in the clusters within this extent record. If yes, init the new bucket and move all the buckets after bucketA one by one to the next bucket. Move half of the entries in bucketA to the next bucket and go on insertion. 4. If there is no new bucket, grow the extent tree. As for xattr deletion, we will delete an xattr bucket when all it's xattrs are removed and move all the buckets after it to the previous one. When all the xattr buckets in an extend record are freed, free this extend records from ocfs2_xattr_tree. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tao Ma	ca12b7c489	ocfs2: Optionally limit extent size in ocfs2_insert_extent() In xattr bucket, we want to limit the maximum size of a btree leaf, otherwise we'll lose the benefits of hashing because we'll have to search large leaves. So add a new field in ocfs2_extent_tree which indicates the maximum leaf cluster size we want so that we can prevent ocfs2_insert_extent() from merging the leaf record even if it is contiguous with an adjacent record. Other btree types are not affected by this change. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tao Ma	589dc2602f	ocfs2: Add xattr lookup code xattr btrees Add code to lookup a given extended attribute in the xattr btree. Lookup follows this general scheme: 1. Use ocfs2_xattr_get_rec to find the xattr extent record 2. Find the xattr bucket within the extent which may contain this xattr 3. Iterate the bucket to find the xattr. In ocfs2_xattr_block_get(), we need to recalcuate the block offset and name offset for the right position of name/value. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tao Ma	0c044f0b24	ocfs2: Add xattr bucket iteration for large numbers of EAs Ocfs2 breaks up xattr index tree leaves into 4k regions, called buckets. Attributes are stored within a given bucket, depending on hash value. After a discussion with Mark, we decided that the per-bucket index (xe_entry[]) would only exist in the 1st block of a bucket. Likewise, name/value pairs will not straddle more than one block. This allows the majority of operations to work directly on the buffer heads in a leaf block. This patch adds code to iterate the buckets in an EA. A new abstration of ocfs2_xattr_bucket is added. It records the bhs in this bucket and ocfs2_xattr_header. This keeps the code neat, improving readibility. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tao Ma	ba492615f0	ocfs2: Add xattr index tree operations When necessary, an ocfs2_xattr_block will embed an ocfs2_extent_list to store large numbers of EAs. This patch adds a new type in ocfs2_extent_tree_type and adds the implementation so that we can re-use the b-tree code to handle the storage of many EAs. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:02 -07:00
Tiger Yang	cf1d6c763f	ocfs2: Add extended attribute support This patch implements storing extended attributes both in inode or a single external block. We only store EA's in-inode when blocksize > 512 or that inode block has free space for it. When an EA's value is larger than 80 bytes, we will store the value via b-tree outside inode or block. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:02 -07:00
Tiger Yang	fdd77704a8	ocfs2: reserve inline space for extended attribute Add the structures and helper functions we want for handling inline extended attributes. We also update the inline-data handlers so that they properly function in the event that we have both inline data and inline attributes sharing an inode block. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:02 -07:00
Tao Ma	f56654c435	ocfs2: Add extent tree operation for xattr value btrees Add some thin wrappers around ocfs2_insert_extent() for each of the 3 different btree types, ocfs2_inode_insert_extent(), ocfs2_xattr_value_insert_extent() and ocfs2_xattr_tree_insert_extent(). The last is for the xattr index btree, which will be used in a followup patch. All the old callers in file.c etc will call ocfs2_dinode_insert_extent(), while the other two handle the xattr issue. And the init of extent tree are handled by these functions. When storing xattr value which is too large, we will allocate some clusters for it and here ocfs2_extent_list and ocfs2_extent_rec will also be used. In order to re-use the b-tree operation code, a new parameter named "private" is added into ocfs2_extent_tree and it is used to indicate the root of ocfs2_exent_list. The reason is that we can't deduce the root from the buffer_head now. It may be in an inode, an ocfs2_xattr_block or even worse, in any place in an ocfs2_xattr_bucket. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:01 -07:00
Tao Ma	ac11c82719	ocfs2: Add helper function in uptodate.c for removing xattr clusters The old uptodate only handles the issue of removing one buffer_head from ocfs2 inode's buffer cache. With xattr clusters, we may need to remove multiple buffer_head's at a time. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:59 -07:00
Tao Ma	5a7bc8eb29	ocfs2: Add the basic xattr disk layout in ocfs2_fs.h Ocfs2 uses a very flexible structure for storing extended attributes on disk. Small amount of attributes are stored directly in the inode block - up to 256 bytes worth. If that fills up, attributes are also stored in an external block, linked to from the inode block. That block can in turn expand to a btree, capable of storing large numbers of attributes. Individual attribute values are stored inline if they're small enough (currently about 80 bytes, this can be changed though), and otherwise are expanded to a btree. The theoretical limit to the size of an individual attribute is about the same as an inode, though the kernel's upper bound on the size of an attributes data is far smaller. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:59 -07:00
Tao Ma	0eb8d47e69	ocfs2: Make high level btree extend code generic Factor out the non-inode specifics of ocfs2_do_extend_allocation() into a more generic function, ocfs2_do_cluster_allocation(). ocfs2_do_extend_allocation calls ocfs2_do_cluster_allocation() now, but the latter can be used for other btree types as well. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:59 -07:00
Tao Ma	e7d4cb6bc1	ocfs2: Abstract ocfs2_extent_tree in b-tree operations. In the old extent tree operation, we take the hypothesis that we are using the ocfs2_extent_list in ocfs2_dinode as the tree root. As xattr will also use ocfs2_extent_list to store large value for a xattr entry, we refactor the tree operation so that xattr can use it directly. The refactoring includes 4 steps: 1. Abstract set/get of last_eb_blk and update_clusters since they may be stored in different location for dinode and xattr. 2. Add a new structure named ocfs2_extent_tree to indicate the extent tree the operation will work on. 3. Remove all the use of fe_bh and di, use root_bh and root_el in extent tree instead. So now all the fe_bh is replaced with et->root_bh, el with root_el accordingly. 4. Make ocfs2_lock_allocators generic. Now it is limited to be only used in file extend allocation. But the whole function is useful when we want to store large EAs. Note: This patch doesn't touch ocfs2_commit_truncate() since it is not used for anything other than truncate inode data btrees. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:58 -07:00
Tao Ma	811f933df1	ocfs2: Use ocfs2_extent_list instead of ocfs2_dinode. ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and ocfs2_reserve_new_metadata() are all useful for extent tree operations. But they are all limited to an inode btree because they use a struct ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list (the part of an ocfs2_dinode they actually use) so that the xattr btree code can use these functions. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:58 -07:00
Tao Ma	231b87d109	ocfs2: Modify ocfs2_num_free_extents for future xattr usage. ocfs2_num_free_extents() is used to find the number of free extent records in an inode btree. Hence, it takes an "ocfs2_dinode" parameter. We want to use this for extended attribute trees in the future, so genericize the interface the take a buffer head. A future patch will allow that buffer_head to contain any structure rooting an ocfs2 btree. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:58 -07:00
Mark Fasheh	9a8ff578fb	ocfs2: track local alloc state via debugfs A per-mount debugfs file, "local_alloc" is created which when read will expose live state of the nodes local alloc file. Performance impact is minimal, only a bit of memory overhead per mount point. Still, the code is hidden behind CONFIG_OCFS2_FS_STATS. This feature will help us debug local alloc performance problems on a live system. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:58 -07:00
Mark Fasheh	9c7af40b21	ocfs2: throttle back local alloc when low on disk space Ocfs2's local allocator disables itself for the duration of a mount point when it has trouble allocating a large enough area from the primary bitmap. That can cause performance problems, especially for disks which were only temporarily full or fragmented. This patch allows for the allocator to shrink it's window first, before being disabled. Later, it can also be re-enabled so that any performance drop is minimized. To do this, we allow the value of osb->local_alloc_bits to be shrunk when needed. The default value is recorded in a mostly read-only variable so that we can re-initialize when required. Locking had to be updated so that we could protect changes to local_alloc_bits. Mostly this involves protecting various local alloc values with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which is used when the local allocator is has shrunk, but is not disabled. If the available space dips below 1 megabyte, the local alloc file is disabled. In either case, local alloc is re-enabled 30 seconds after the event, or when an appropriate amount of bits is seen in the primary bitmap. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:57 -07:00
Mark Fasheh	ebcee4b5c9	ocfs2: Track local alloc bits internally Do this instead of tracking absolute local alloc size. This avoids needless re-calculatiion of bits from bytes in localalloc.c. Additionally, the value is now in a more natural unit for internal file system bitmap work. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:57 -07:00
Mark Fasheh	53da4939f3	ocfs2: POSIX file locks support This is actually pretty easy since fs/dlm already handles the bulk of the work. The Ocfs2 userspace cluster stack module already uses fs/dlm as the underlying lock manager, so I only had to add the right calls. Cluster-aware POSIX locks ("plocks") can be turned off by the same means at UNIX locks - mount with 'noflocks', or create a local-only Ocfs2 volume. Internally, the file system uses two sets of file_operations, depending on whether cluster aware plocks is required. This turns out to be easier than implementing local-only versions of ->lock. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:57 -07:00
Steven Whitehouse	a447c09324	vfs: Use const for kernel parser table This is a much better version of a previous patch to make the parser tables constant. Rather than changing the typedef, we put the "const" in all the various places where its required, allowing the __initconst exception for nfsroot which was the cause of the previous trouble. This was posted for review some time ago and I believe its been in -mm since then. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Alexander Viro <aviro@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 10:10:37 -07:00
Linus Torvalds	20272c8994	Merge branch 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc * 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: proc: remove kernel.maps_protect proc: remove now unneeded ADDBUF macro [PATCH] proc: show personality via /proc/pid/personality [PATCH] signal, procfs: some lock_task_sighand() users do not need rcu_read_lock() proc: move PROC_PAGE_MONITOR to fs/proc/Kconfig proc: make grab_header() static proc: remove unused get_dma_list() proc: remove dummy vmcore_open() proc: proc_sys_root tweak proc: fix return value of proc_reg_open() in "too late" case Fixed up trivial conflict in removed file arch/sparc/include/asm/dma_32.h	2008-10-13 10:04:04 -07:00
Linus Torvalds	8d71ff0bef	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (24 commits) integrity: special fs magic As pointed out by Jonathan Corbet, the timer must be deleted before ERROR: code indent should use tabs where possible The tpm_dev_release function is only called for platform devices, not pnp Protect tpm_chip_list when transversing it. Renames num_open to is_open, as only one process can open the file at a time. Remove the BKL calls from the TPM driver, which were added in the overall netlabel: Add configuration support for local labeling cipso: Add support for native local labeling and fixup mapping names netlabel: Changes to the NetLabel security attributes to allow LSMs to pass full contexts selinux: Cache NetLabel secattrs in the socket's security struct selinux: Set socket NetLabel based on connection endpoint netlabel: Add functionality to set the security attributes of a packet netlabel: Add network address selectors to the NetLabel/LSM domain mapping netlabel: Add a generic way to create ordered linked lists of network addrs netlabel: Replace protocol/NetLabel linking with refrerence counts smack: Fix missing calls to netlbl_skbuff_err() selinux: Fix missing calls to netlbl_skbuff_err() selinux: Fix a problem in security_netlbl_sid_to_secattr() selinux: Better local/forward check in selinux_ip_postroute() ...	2008-10-13 10:00:44 -07:00
Linus Torvalds	244dc4e54b	Merge git://git.infradead.org/users/dwmw2/random-2.6 * git://git.infradead.org/users/dwmw2/random-2.6: Fix autoloading of MacBook Pro backlight driver. Automatic MODULE_ALIAS() for DMI match tables. Remove asm/a.out.h files for all architectures without a.out support. Introduce HAVE_AOUT symbol to remove hard-coded arch list for BINFMT_AOUT Remove redundant CONFIG_ARCH_SUPPORTS_AOUT S390: Update comments about why we don't use <asm-generic/statfs.h> SPARC: Use <asm-generic/statfs.h> PowerPC: Use <asm-generic/statfs.h> PARISC: Use <asm-generic/statfs.h> x86_64: Use <asm-generic/statfs.h> IA64: Use <asm-generic/statfs.h> ARM: Use <asm-generic/statfs.h> Make <asm-generic/statfs.h> suitable for 64-bit platforms. Define and use PCI_DEVICE_ID_MARVELL_88ALP01_CCIC for CAFÉ camera driver [MTD] [NAND] Define and use PCI_DEVICE_ID_MARVELL_88ALP01_NAND for CAFÉ Use PCI_DEVICE_ID_88ALP01 for CAFÉ chip, rather than PCI_DEVICE_ID_CAFE. EFS: Don't set f_fsid in statfs().	2008-10-13 09:59:14 -07:00
Sukadev Bhattiprolu	a6f37daa8b	Simplify devpts_pty_kill When creating a new pty, save the pty's inode in the tty->driver_data. Use this inode in pty_kill() to identify the devpts instance. Since we now have the inode for the pty, we can skip get_node() lookup and remove the unused get_node(). TODO: - check if the mutex_lock is needed in pty_kill(). Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:43 -07:00
Sukadev Bhattiprolu	89a52e109e	Simplify devpts_pty_new() devpts_pty_new() is called when setting up a new pty and would not will not have an existing dentry or inode for the pty. So don't bother looking for an existing dentry - just create a new one. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:43 -07:00
Sukadev Bhattiprolu	527b3e4773	Simplify devpts_get_tty() As pointed out by H. Peter Anvin, since the inode for the pty is known, we don't need to look it up. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:43 -07:00
Sukadev Bhattiprolu	15f1a6338d	Add an instance parameter devpts interfaces Pass-in 'inode' or 'tty' parameter to devpts interfaces. With multiple devpts instances, these parameters will be used in subsequent patches to identify the instance of devpts mounted. The parameters also help simplify devpts implementation. Changelog[v3]: - minor changes due to merge with ttydev updates - rename parameters to emphasize they are ptmx or pts inodes - pass-in tty_struct * to devpts_pty_kill() (this will help cleanup the get_node() call in a subsequent patch) Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:43 -07:00
Alan Cox	934e6ebf96	tty: Redo current tty locking Currently it is sometimes locked by the tty mutex and sometimes by the sighand lock. The latter is in fact correct and now we can hand back referenced objects we can fix this up without problems around sleeping functions. Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:41 -07:00
Alan Cox	2cb5998b5f	tty: the vhangup syscall is racy We now have the infrastructure to sort this out but rather than teaching the syscall tty lock rules we move the hard work into a tty helper Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:41 -07:00
Alan Cox	452a00d2ee	tty: Make get_current_tty use a kref We now return a kref covered tty reference. That ensures the tty structure doesn't go away when you have a return from get_current_tty. This is not enough to protect you from most of the resources being freed behind your back - yet. [Updated to include fixes for SELinux problems found by Andrew Morton and an s390 leak found while debugging the former] Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:41 -07:00
David Woodhouse	e758936e02	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 Conflicts: include/asm-x86/statfs.h	2008-10-13 17:13:56 +01:00
Linus Torvalds	3280fb3139	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix kconfig typo and extra whitespace ext4: fix build failure without procfs ext4: add an option to control error handling on file data jbd2: don't dirty original metadata buffer on abort ext4: add checks for errors from jbd2 jbd2: fix error handling for checkpoint io jbd2: abort when failed to log metadata buffers	2008-10-12 16:10:29 -07:00
Mimi Zohar	9256292782	integrity: special fs magic Discussion on the mailing list questioned the use of these magic values in userspace, concluding these values are already exported to userspace via statfs and their correct/incorrect usage is left up to the userspace application. - Move special fs magic number definitions to magic.h - Add magic.h include Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>	2008-10-13 09:47:43 +11:00
Jan Engelhardt	f319fb8bf6	ext4: fix kconfig typo and extra whitespace Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-12 15:53:01 -04:00
Alexander Beregalov	3244fcb1ae	ext4: fix build failure without procfs fs/ext4/super.c: In function 'ext4_fill_super': fs/ext4/super.c:2226: error: 'ext4_ui_proc_fops' undeclared (first use in this function) fs/ext4/super.c:2226: error: (Each undeclared identifier is reported only once fs/ext4/super.c:2226: error: for each function it appears in.) Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-12 17:27:49 -04:00
Linus Torvalds	f1b2a5ace9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: [CIFS] cifs: remove pointless lock and unlock of GlobalMid_Lock in header_assemble	2008-10-12 12:42:36 -07:00
Adrian Bunk	06270d5d6a	provide generic_block_fiemap() only with BLOCK=y This fixes the following compile error with CONFIG_BLOCK=n caused by commit `68c9d702bb` ("generic block based fiemap implementation"): CC fs/ioctl.o fs/ioctl.c: In function 'generic_block_fiemap': fs/ioctl.c:249: error: storage size of 'tmp' isn't known fs/ioctl.c:272: error: invalid application of 'sizeof' to incomplete type 'struct buffer_head' fs/ioctl.c:280: error: implicit declaration of function 'buffer_mapped' fs/ioctl.c:249: warning: unused variable 'tmp' make[2]: *** [fs/ioctl.o] Error 1 Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-12 11:44:37 -07:00
Jeff Layton	14835a3325	[CIFS] cifs: remove pointless lock and unlock of GlobalMid_Lock in header_assemble We lock GlobalMid_Lock in header_assemble and then immediately unlock it again without doing anything. Not sure what this was intended to do, but remove it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-12 13:34:11 +00:00
Linus Torvalds	fd04808830	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (43 commits) ext4: Rename ext4dev to ext4 ext4: Avoid double dirtying of super block in ext4_put_super() Update ext4 MAINTAINERS file Hook ext4 to the vfs fiemap interface. generic block based fiemap implementation ocfs2: fiemap support vfs: vfs-level fiemap interface ext4: fix xattr deadlock jbd2: Fix buffer head leak when writing the commit block ext4: Add debugging markers that can be used by systemtap jbd2: abort instead of waiting for nonexistent transaction ext4: fix initialization of UNINIT bitmap blocks ext4: Remove old legacy block allocator ext4: Use readahead when reading an inode from the inode table ext4: Improve the documentation for ext4's /proc tunables ext4: Combine proc file handling into a single set of functions ext4: move /proc setup and teardown out of mballoc.c ext4: Don't use 'struct dentry' for internal lookups ext4/jbd2: Avoid WARN() messages when failing to write to the superblock ext4: use percpu data structures for lg_prealloc_list ...	2008-10-11 13:23:48 -07:00
Linus Torvalds	86ed5a93b8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: [CIFS] Check that last search entry resume key is valid [CIFS] make sure we have the right resume info before calling CIFSFindNext [CIFS] clean up error handling in cifs_unlink [CIFS] fix some settings of cifsAttrs after calling SetFileInfo and SetPathInfo cifs: explicitly revoke SPNEGO key after session setup cifs: Convert cifs to new aops. [CIFS] update DOS attributes in cifsInode if we successfully changed them cifs: remove NULL termination from rename target in CIFSSMBRenameOpenFIle cifs: work around samba returning -ENOENT on SetFileDisposition call cifs: fix inverted NULL check after kmalloc [CIFS] clean up upcall handling for dns_resolver keys [CIFS] fix busy-file renames and refactor cifs_rename logic cifs: add function to set file disposition [CIFS] add constants for string lengths of keynames in SPNEGO upcall string cifs: move rename and delete-on-close logic into helper function cifs: have find_writeable_file prefer filehandles opened by same task cifs: don't use GFP_KERNEL with GFP_NOFS [CIFS] use common code for turning off ATTR_READONLY in cifs_unlink cifs: clean up variables in cifs_unlink	2008-10-11 09:31:53 -07:00
Hidehiro Kawai	5bf5683a33	ext4: add an option to control error handling on file data If the journal doesn't abort when it gets an IO error in file data blocks, the file data corruption will spread silently. Because most of applications and commands do buffered writes without fsync(), they don't notice the IO error. It's scary for mission critical systems. On the other hand, if the journal aborts whenever it gets an IO error in file data blocks, the system will easily become inoperable. So this patch introduces a filesystem option to determine whether it aborts the journal or just call printk() when it gets an IO error in file data. If you mount an ext4 fs with data_err=abort option, it aborts on file data write error. If you mount it with data_err=ignore, it doesn't abort, just call printk(). data_err=ignore is the default. Here is the corresponding patch of the ext3 version: http://kerneltrap.org/mailarchive/linux-kernel/2008/9/9/3239374 Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 22:12:43 -04:00
Hidehiro Kawai	7ad7445f60	jbd2: don't dirty original metadata buffer on abort Currently, original metadata buffers are dirtied when they are unfiled whether the journal has aborted or not. Eventually these buffers will be written-back to the filesystem by pdflush. This means some metadata buffers are written to the filesystem without journaling if the journal aborts. So if both journal abort and system crash happen at the same time, the filesystem would become inconsistent state. Additionally, replaying journaled metadata can overwrite the latest metadata on the filesystem partly. Because, if the journal gets aborted, journaled metadata are preserved and replayed during the next mount not to lose uncheckpointed metadata. This would also break the consistency of the filesystem. This patch prevents original metadata buffers from being dirtied on abort by clearing BH_JBDDirty flag from those buffers. Thus, no metadata buffers are written to the filesystem without journaling. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:29:31 -04:00
Hidehiro Kawai	7ffe1ea894	ext4: add checks for errors from jbd2 If the journal has aborted due to a checkpointing failure, we have to keep the contents of the journal space. Otherwise, the filesystem will lose uncheckpointed metadata completely and become inconsistent. To avoid this, we need to keep needs_recovery flag if checkpoint has failed. With this patch, ext4_put_super() detects a checkpointing failure from the return value of journal_destroy(), then it invokes ext4_abort() to make the filesystem read only and keep needs_recovery flag. Errors from jbd2_journal_flush() are also handled by this patch in some places. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:29:21 -04:00
Hidehiro Kawai	44519faf22	jbd2: fix error handling for checkpoint io When a checkpointing IO fails, current JBD2 code doesn't check the error and continue journaling. This means latest metadata can be lost from both the journal and filesystem. This patch leaves the failed metadata blocks in the journal space and aborts journaling in the case of jbd2_log_do_checkpoint(). To achieve this, we need to do: 1. don't remove the failed buffer from the checkpoint list where in the case of __try_to_free_cp_buf() because it may be released or overwritten by a later transaction 2. jbd2_log_do_checkpoint() is the last chance, remove the failed buffer from the checkpoint list and abort the journal 3. when checkpointing fails, don't update the journal super block to prevent the journaled contents from being cleaned. For safety, don't update j_tail and j_tail_sequence either 4. when checkpointing fails, notify this error to the ext4 layer so that ext4 don't clear the needs_recovery flag, otherwise the journaled contents are ignored and cleaned in the recovery phase 5. if the recovery fails, keep the needs_recovery flag 6. prevent jbd2_cleanup_journal_tail() from being called between __jbd2_journal_drop_transaction() and jbd2_journal_abort() (a possible race issue between jbd2_log_do_checkpoint()s called by jbd2_journal_flush() and __jbd2_log_wait_for_space()) Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:29:13 -04:00
Hidehiro Kawai	77e841de8a	jbd2: abort when failed to log metadata buffers If we failed to write metadata buffers to the journal space and succeeded to write the commit record, stale data can be written back to the filesystem as metadata in the recovery phase. To avoid this, when we failed to write out metadata buffers, abort the journal before writing the commit record. We can also avoid this kind of corruption by using the journal checksum feature because it can detect invalid metadata blocks in the journal and avoid them from being replayed. So we don't need to care about asynchronous commit record writeout with a checksum. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-12 16:39:16 -04:00
Aneesh Kumar K.V	a1aebc1e2d	ext4: Don't reuse released data blocks until transaction commits We need to make sure we don't reuse the data blocks released during the transaction untill the transaction commits. We force this mode only for ordered and journalled mode. Writeback mode already don't provided data consistency. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:13:31 -04:00
Aneesh Kumar K.V	c894058d66	ext4: Use an rbtree for tracking blocks freed during transaction. With this patch we track the block freed during a transaction using red-black tree. We also make sure contiguous blocks freed are collected in one node in the tree. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-16 10:14:27 -04:00
Aneesh Kumar K.V	c2774d84fd	ext4: Do mballoc init before doing filesystem recovery During filesystem recovery we may be doing a truncate which expects some of the mballoc data structures to be initialized. So do ext4_mb_init before recovery. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:07:20 -04:00
Aneesh Kumar K.V	688f05a019	ext4: Free ext4_prealloc_space using kmem_cache_free We should use kmem_cache_free to free memory allocated via kmem_cache_alloc Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-13 12:14:14 -04:00
Manish Katiyar	473dc8eddb	ext4: Fix Kconfig typo for ext4dev Looks like there is one more instance where ext4dev should be changed to ext4 because the module name will be "ext4" unless EXT4DEV_COMPAT is selected. Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-13 09:01:02 -04:00
Martin Michlmayr	cab893d909	ext4: Remove an old reference to ext4dev in Makefile comment Remove an old reference to ext4dev. Signed-off-by: Martin Michlmayr <tbm@cyrius.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-17 15:03:38 -04:00
Theodore Ts'o	03010a3350	ext4: Rename ext4dev to ext4 The ext4 filesystem is getting stable enough that it's time to drop the "dev" prefix. Also remove the requirement for the TEST_FILESYS flag. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-10 20:02:48 -04:00
Chuck Lever	5e2e7721f0	NFS: fix nfs_parse_ip_address() corner case Bruce observed that nfs_parse_ip_address() will successfully parse an IPv6 address that looks like this: "::1%" A scope delimiter is present, but there is no scope ID following it. This is harmless, as it would simply set the scope ID to zero. However, in some cases we would like to flag this as an improperly formed address. We are now also careful to reject addresses where garbage follows the address (up to the length of the string), instead of ignoring the non-address characters; and where the scope ID is nonsense (not a valid device name, but also not numeric). Before, both of these cases would result in a harmless zero scope ID. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-10 14:41:51 -04:00
J. Bruce Fields	456018d791	NFS: Cleanup nfs_set_port Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-10 14:41:50 -04:00
Linus Torvalds	13dd7f876d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: choose better identifiers dlm: remove bkl dlm: fix address compare dlm: fix locking of lockspace list in dlm_scand dlm: detect available userspace daemon dlm: allow multiple lockspace creates	2008-10-10 11:13:55 -07:00
Christoph Hellwig	73f6aa4d44	Fix barrier fail detection in XFS Currently we disable barriers as soon as we get a buffer in xlog_iodone that has the XBF_ORDERED flag cleared. But this can be the case not only for buffers where the barrier failed, but also the first buffer of a split log write in case of a log wraparound. Due to the disabled barriers we can easily get directory corruption on unclean shutdowns. So instead of using this check add a new buffer flag for failed barrier writes. This is a regression vs 2.6.26 caused by patch to use the right macro to check for the ORDERED flag, as we previously got true returned for every buffer. Thanks to Toei Rei for reporting the bug. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: David Chinner <david@fromorbit.com> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-10 11:08:07 -07:00
Linus Torvalds	445e1ceda3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: GFS2: Support for I/O barriers GFS2: Add UUID to GFS2 sb GFS2: high time to take some time over atime GFS2: The war on bloat GFS2: GFS2 will panic if you misspell any mount options GFS2: Direct IO write at end of file error GFS2: Use an IS_ERR test rather than a NULL test GFS2: Fix race relating to glock min-hold time GFS2: Fix & clean up GFS2 rename GFS2: rm on multiple nodes causes panic GFS2: Fix metafs mounts GFS2: Fix debugfs glock file iterator	2008-10-10 11:02:22 -07:00
Linus Torvalds	e26feff647	Merge branch 'for-2.6.28' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.28' of git://git.kernel.dk/linux-2.6-block: (132 commits) doc/cdrom: Trvial documentation error, file not present block_dev: fix kernel-doc in new functions block: add some comments around the bio read-write flags block: mark bio_split_pool static block: Find bio sector offset given idx and offset block: gendisk integrity wrapper block: Switch blk_integrity_compare from bdev to gendisk block: Fix double put in blk_integrity_unregister block: Introduce integrity data ownership flag block: revert part of d7533ad0e132f92e75c1b2eb7c26387b25a583c1 bio.h: Remove unused conditional code block: remove end_{queued\|dequeued}_request() block: change elevator to use __blk_end_request() gdrom: change to use __blk_end_request() memstick: change to use __blk_end_request() virtio_blk: change to use __blk_end_request() blktrace: use BLKTRACE_BDEV_SIZE as the name size for setup structure block: add lld busy state exporting interface block: Fix blk_start_queueing() to not kick a stopped queue include blktrace_api.h in headers_install ...	2008-10-10 10:52:45 -07:00
Alexey Dobriyan	3bbfe05967	proc: remove kernel.maps_protect After commit `831830b5a2` aka "restrict reading from /proc/<pid>/maps to those who share ->mm or can ptrace" sysctl stopped being relevant because commit moved security checks from ->show time to ->start time (mm_for_maps()). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Kees Cook <kees.cook@canonical.com>	2008-10-10 04:24:51 +04:00
Alexey Dobriyan	45acb8db06	proc: remove now unneeded ADDBUF macro After local seq_file conversion it was forgotten. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:58 +04:00
Kees Cook	4783072308	[PATCH] proc: show personality via /proc/pid/personality Make process personality flags visible in /proc. Since a process's personality is potentially sensitive (e.g. READ_IMPLIES_EXEC), make this file only readable by the process owner. Signed-off-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:57 +04:00
Lai Jiangshan	a6bebbc87a	[PATCH] signal, procfs: some lock_task_sighand() users do not need rcu_read_lock() lock_task_sighand() make sure task->sighand is being protected, so we do not need rcu_read_lock(). [ exec() will get task->sighand->siglock before change task->sighand! ] But code using rcu_read_lock() _just_ to protect lock_task_sighand() only appear in procfs. (and some code in procfs use lock_task_sighand() without such redundant protection.) Other subsystem may put lock_task_sighand() into rcu_read_lock() critical region, but these rcu_read_lock() are used for protecting "for_each_process()", "find_task_by_vpid()" etc. , not for protecting lock_task_sighand(). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> [ok from Oleg] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:57 +04:00
Alexey Dobriyan	53167a3ef2	proc: move PROC_PAGE_MONITOR to fs/proc/Kconfig Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:57 +04:00
Adrian Bunk	81324364b7	proc: make grab_header() static Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:56 +04:00
Alexey Dobriyan	a70973c214	proc: remove unused get_dma_list() Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:56 +04:00
Alexey Dobriyan	a04f4de641	proc: remove dummy vmcore_open() Empty ->open is equivalent to always succeeding ->open. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:55 +04:00
Alexey Dobriyan	e1675231ce	proc: proc_sys_root tweak Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:55 +04:00
Alexey Dobriyan	300b994b74	proc: fix return value of proc_reg_open() in "too late" case If ->open() wasn't called, returning 0 is misleading and, theoretically, oopsable: 1) remove_proc_entry clears ->proc_fops, drops lock, 2) ->open "succeeds", 3) ->release oopses, because it assumes ->open was called (single_release()). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:54 +04:00
Linus Torvalds	efc968d450	Don't allow splice() to files opened with O_APPEND This is debatable, but while we're debating it, let's disallow the combination of splice and an O_APPEND destination. It's not entirely clear what the semantics of O_APPEND should be, and POSIX apparently expects pwrite() to ignore O_APPEND, for example. So we could make up any semantics we want, including the old ones. But Miklos convinced me that we should at least give it some thought, and that accepting writes at arbitrary offsets is wrong at least for IS_APPEND() files (which always have O_APPEND set, even if the reverse isn't true: you can obviously have O_APPEND set on a regular file). So disallow O_APPEND entirely for now. I doubt anybody cares, and this way we have one less gray area to worry about. Reported-and-argued-for-by: Miklos Szeredi <miklos@szeredi.hu> Acked-by: Jens Axboe <ens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-09 14:26:38 -07:00
Trond Myklebust	03254e65a6	NFS: Fix attribute updates This fixes a regression seen when running the Connectathon testsuite against an ext3 filesystem. The reason was that the inode was constantly being marked as 'just updated' by the jiffy wraparound test. This again meant that newer GETATTR calls were failing to pass the nfs_inode_attrs_need_update() test unless the changes caused a ctime update on the server, since they were perceived as having been started before the latest inode update. Given that nfs_inode_attrs_need_update() already checks for wraparound of nfsi->last_updated, we can drop the buggy "protection" in nfs_update_inode(). Also make a slight micro-optimisation of nfs_inode_attrs_need_update(): we are more often going to see time_after(fattr->time_start, nfsi->last_updated) be true, rather than seeing an update of ctime/size, so put that test first to ensure that we optimise away the ctime/size tests. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-09 13:34:07 -04:00
Randy Dunlap	57d1b5366f	block_dev: fix kernel-doc in new functions Fix kernel-doc in new functions: Error(mmotm-2008-1002-1617//fs/block_dev.c:895): duplicate section name 'Description' Error(mmotm-2008-1002-1617//fs/block_dev.c:924): duplicate section name 'Description' Warning(mmotm-2008-1002-1617//fs/block_dev.c:1282): No description found for parameter 'pathname' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> cc: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 10:42:38 +02:00
Denis ChengRq	6feef531f5	block: mark bio_split_pool static Since all bio_split calls refer the same single bio_split_pool, the bio_split function can use bio_split_pool directly instead of the mempool_t parameter; then the mempool_t parameter can be removed from bio_split param list, and bio_split_pool is only referred in fs/bio.c file, can be marked static. Signed-off-by: Denis ChengRq <crquan@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:57:05 +02:00
Martin K. Petersen	ad3316bf4e	block: Find bio sector offset given idx and offset Helper function to find the sector offset in a bio given bvec index and page offset. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:22 +02:00
Martin K. Petersen	74aa8c2cc0	block: Introduce integrity data ownership flag A filesystem might supply its own integrity metadata. Introduce a flag that indicates whether the filesystem or the block layer owns the integrity buffer. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:21 +02:00
Jens Axboe	b04accc425	block: revert part of d7533ad0e132f92e75c1b2eb7c26387b25a583c1 We need bdev_get_integrity() to support the pending md/dm patches. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:21 +02:00
Jens Axboe	9c02f2b02e	block: cleanup some of the integrity stuff in blkdev.h Don't put functions that are only used in fs/bio-integrity.c in blkdev.h, it's much cleaner to just keep it in there. Also kill completely unused bdev_get_tag_size() Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:17 +02:00
Jens Axboe	0a0d96b03a	block: add bio_kmalloc() Not all callers need (or want!) the mempool backing guarentee, it essentially means that you can only use bio_alloc() for short allocations and not for preallocating some bio's at setup or init time. So add bio_kmalloc() which does the same thing as bio_alloc(), except it just uses kmalloc() as the backing instead of the bio mempools. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:17 +02:00
Andrew Patterson	608aeef17a	Call flush_disk() after detecting an online resize. We call flush_disk() to make sure the buffer cache for the disk is flushed after a disk resize. There are two resize cases, growing and shrinking. Given that users can shrink/then grow a disk before revalidate_disk() is called, we treat the grow case identically to shrinking. We need to flush the buffer cache after an online shrink because, as James Bottomley puts it, The two use cases for shrinking I can see are 1. planned: the fs is already shrunk to within the new boundaries and all data is relocated, so invalidate is fine (any dirty buffers that might exist in the shrunk region are there only because they were relocated but not yet written to their original location). 2. unplanned: In this case, the fs is probably toast, so whether we invalidate or not isn't going to make a whole lot of difference; it's still going to try to read or write from sectors beyond the new size and get I/O errors. Immediately invalidating shrunk disks will cause errors for outstanding I/Os for reads/write beyond the new end of the disk to be generated earlier then if we waited for the normal buffer cache operation. It also removes a potential security hole where we might keep old data around from beyond the end of the shrunk disk if the disk was not invalidated. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:13 +02:00
Andrew Patterson	56ade44b46	Added flush_disk to factor out common buffer cache flushing code. We need to be able to flush the buffer cache for for more than just when a disk is changed, so we factor out common cache flush code in check_disk_change() to an internal flush_disk() routine. This routine will then be used for both disk changes and disk resizes (in a later patch). Include the disk name in the text indicating that there are busy inodes on the device and increase the KERN severity of the message. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:13 +02:00
Andrew Patterson	9bc3ffbfbd	Check for device resize when rescanning partitions Check for device resize in the rescan_partitions() routine. If the device has been resized, the bdev size is set to match. The rescan_partitions() routine is called when opening the device and when calling the BLKRRPART ioctl. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:12 +02:00
Andrew Patterson	c3279d1454	Adjust block device size after an online resize of a disk. The revalidate_disk routine now checks if a disk has been resized by comparing the gendisk capacity to the bdev inode size. If they are different (usually because the disk has been resized underneath the kernel) the bdev inode size is adjusted to match the capacity. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:12 +02:00
Andrew Patterson	0c002c2f74	Wrapper for lower-level revalidate_disk routines. This is a wrapper for the lower-level revalidate_disk call-backs such as sd_revalidate_disk(). It allows us to perform pre and post operations when calling them. We will use this wrapper in a later patch to adjust block device sizes after an online resize (a _post_ operation). Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:12 +02:00
FUJITA Tomonori	818827669d	block: make blk_rq_map_user take a NULL user-space buffer This patch changes blk_rq_map_user to accept a NULL user-space buffer with a READ command if rq_map_data is not NULL. Thus a caller can pass page frames to lk_rq_map_user to just set up a request and bios with page frames propely. bio_uncopy_user (called via blk_rq_unmap_user) doesn't copy data to user space with such request. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:11 +02:00
FUJITA Tomonori	4d8ab62e08	bio: convert bio_copy_kern to use bio_copy_user bio_copy_kern and bio_copy_user are very similar. This converts bio_copy_kern to use bio_copy_user. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:10 +02:00
FUJITA Tomonori	152e283fdf	block: introduce struct rq_map_data to use reserved pages This patch introduces struct rq_map_data to enable bio_copy_use_iov() use reserved pages. Currently, bio_copy_user_iov allocates bounce pages but drivers/scsi/sg.c wants to allocate pages by itself and use them. struct rq_map_data can be used to pass allocated pages to bio_copy_user_iov. The current users of bio_copy_user_iov simply passes NULL (they don't want to use pre-allocated pages). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Douglas Gilbert <dougg@torque.net> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:10 +02:00
FUJITA Tomonori	a3bce90edd	block: add gfp_mask argument to blk_rq_map_user and blk_rq_map_user_iov Currently, blk_rq_map_user and blk_rq_map_user_iov always do GFP_KERNEL allocation. This adds gfp_mask argument to blk_rq_map_user and blk_rq_map_user_iov so sg can use it (sg always does GFP_ATOMIC allocation). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Douglas Gilbert <dougg@torque.net> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:10 +02:00
Jens Axboe	c7c22e4d5c	block: add support for IO CPU affinity This patch adds support for controlling the IO completion CPU of either all requests on a queue, or on a per-request basis. We export a sysfs variable (rq_affinity) which, if set, migrates completions of requests to the CPU that originally submitted it. A bio helper (bio_set_completion_cpu()) is also added, so that queuers can ask for completion on that specific CPU. In testing, this has been show to cut the system time by as much as 20-40% on synthetic workloads where CPU affinity is desired. This requires a little help from the architecture, so it'll only work as designed for archs that are using the new generic smp helper infrastructure. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:09 +02:00
Tejun Heo	3e1a7ff8a0	block: allow disk to have extended device number Now that disk and partition handlings are mostly unified, it's easy to allow disk to have extended device number. This patch makes add_disk() use extended device number if disk->minors is zero. Both sd and ide-disk are updated to use this. * sd_format_disk_name() is implemented which can generically determine the drive name. This removes disk number restriction stemming from limited device names. * If sd index goes over SD_MAX_DISKS (which can be increased now BTW), sd simply doesn't initialize minors letting block layer choose extended device number. * If CONFIG_DEBUG_EXT_DEVT is set, both sd and ide-disk always set minors to 0 and use extended device numbers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	689d6fac40	block: replace @ext_minors with GENHD_FL_EXT_DEVT With previous changes, it's meaningless to limit the number of partitions. Replace @ext_minors with GENHD_FL_EXT_DEVT such that setting the flag allows the disk to have maximum number of allowed partitions (only limited by the number of entries in parsed_partitions as determined by MAX_PART constant). This kills not-too-pretty alloc_disk_ext[_node]() functions and makes @minors parameter to alloc_disk[_node]() unnecessary. The parameter is left alone to avoid disturbing the users. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	540eed5637	block: make partition array dynamic disk->__part used to be statically allocated to the maximum possible number of partitions. This patch makes partition array allocation dynamic. The added overhead is minimal as only real change is one memory dereference changed to RCU one. This saves both a bit of memory and cpu cycles iterating through unoccupied slots and makes increasing partition limit easier. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	074a7aca7a	block: move stats from disk to part0 Move stats related fields - stamp, in_flight, dkstats - from disk to part0 and unify stat handling such that... * part_stat_() now updates part0 together if the specified partition is not part0. ie. part_stat_() are now essentially all_stat_(). {disk\|all}_stat_() are gone. part_round_stats() is updated similary. It handles part0 stats automatically and disk_round_stats() is killed. * part_{inc\|dec}_in_fligh() is implemented which automatically updates part0 stats for parts other than part0. * disk_map_sector_rcu() is updated to return part0 if no part matches. Combined with the above changes, this makes NULL special case handling in callers unnecessary. * Separate stats show code paths for disk are collapsed into part stats show code paths. * Rename disk_stat_lock/unlock() to part_stat_lock/unlock() While at it, reposition stat handling macros a bit and add missing parentheses around macro parameters. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	eddb2e26b5	block: kill GENHD_FL_FAIL and use part0->make_it_fail GENHD_FL_FAIL for disk is what make_it_fail is for parts. Kill it and use part0->make_it_fail. Sysfs node handling is unified too. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	0762b8bde9	block: always set bdev->bd_part Till now, bdev->bd_part is set only if the bdev was for parts other than part0. This patch makes bdev->bd_part always set so that code paths don't have to differenciate common handling. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	4c46501d16	block: move holder_dir from disk to part0 Move disk->holder_dir to part0->holder_dir. Kill now mostly superflous bdev_get_holder(). While at it, kill superflous kobject_get/put() around holder_dir, slave_dir and cmd_filter creation and collapse disk_sysfs_add_subdirs() into register_disk(). These serve no purpose but obfuscating the code. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	b7db9956e5	block: move policy from disk to part0 Move disk->policy to part0->policy. Implement and use get_disk_ro(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	e561052149	block: unify sysfs size node handling Now that capacity and __dev are moved to part0, part0 and others can share the same method. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	80795aefb7	block: move capacity from disk to part0 Move disk->capacity to part0->nr_sects and convert all users who directly accessed the field to use {get\|set}_capacity(). This is done early to allow the __dev field to be moved. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	b5d0b9df0b	block: introduce partition 0 genhd and partition code handled disk and partitions separately. All information about the whole disk was in struct genhd and partitions in struct hd_struct. However, the whole disk (part0) and other partitions have a lot in common and the data structures end up having good number of common fields and thus separate code paths doing the same thing. Also, the partition array was indexed by partno - 1 which gets pretty confusing at times. This patch introduces partition 0 and makes the partition array indexed by partno. Following patches will unify the handling of disk and parts piece-by-piece. This patch also implements disk_partitionable() which tests whether a disk is partitionable. With coming dynamic partition array change, the most common usage of disk_max_parts() will be testing whether a disk is partitionable and the number of max partitions will become much less important. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	ed9e198234	block: implement and use {disk\|part}_to_dev() Implement {disk\|part}_to_dev() and use them to access generic device instead of directly dereferencing {disk\|part}->dev. To make sure no user is left behind, rename generic devices fields to __dev. This is in preparation of unifying partition 0 handling with other partitions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	bcce3de1be	block: implement extended dev numbers Implement extended device numbers. A block driver can tell block layer that it wants to use extended device numbers. After the usual minor space is used up, block layer automatically allocates devt's from EXT_BLOCK_MAJOR. Currently only one major number is allocated for this but as the allocation is strictly on-demand, ~1mil minor space under it should suffice unless the system actually has more than ~1mil partitions and if that ever happens adding more majors to the extended devt area is easy. Due to internal implementation issues, the first partition can't be allocated on the extended area. In other words, genhd->minors should at least be 1. This limitation will be lifted by later changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:06 +02:00
Tejun Heo	c995905916	block: fix diskstats access There are two variants of stat functions - ones prefixed with double underbars which don't care about preemption and ones without which disable preemption before manipulating per-cpu counters. It's unclear whether the underbarred ones assume that preemtion is disabled on entry as some callers don't do that. This patch unifies diskstats access by implementing disk_stat_lock() and disk_stat_unlock() which take care of both RCU (for partition access) and preemption (for per-cpu counter access). diskstats access should always be enclosed between the two functions. As such, there's no need for the versions which disables preemption. They're removed and double underbars ones are renamed to drop the underbars. As an extra argument is added, there's no danger of using the old version unconverted. disk_stat_lock() uses get_cpu() and returns the cpu index and all diskstat functions which access per-cpu counters now has @cpu argument to help RT. This change adds RCU or preemption operations at some places but also collapses several preemption ops into one at others. Overall, the performance difference should be negligible as all involved ops are very lightweight per-cpu ones. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:06 +02:00
Tejun Heo	e71bf0d0ee	block: fix disk->part[] dereferencing race disk->part[] is protected by its matching bdev's lock. However, non-critical accesses like collecting stats and printing out sysfs and proc information used to be performed without any locking. As partitions can come and go dynamically, partitions can go away underneath those non-critical accesses. As some of those accesses are writes, this theoretically can lead to silent corruption. This patch fixes the race by using RCU for the partition array and dev reference counter to hold partitions. * Rename disk->part[] to disk->__part[] to make sure no one outside genhd layer proper accesses it directly. * Use RCU for disk->__part[] dereferencing. * Implement disk_{get\|put}_part() which can be used to get and put partitions from gendisk respectively. * Iterators are implemented to help iterate through all partitions safely. * Functions which require RCU readlock are marked with _rcu suffix. * Use disk_put_part() in __blkdev_put() instead of directly putting the contained kobject. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:06 +02:00
Tejun Heo	f331c0296f	block: don't depend on consecutive minor space * Implement disk_devt() and part_devt() and use them to directly access devt instead of computing it from ->major and ->first_minor. Note that all references to ->major and ->first_minor outside of block layer is used to determine devt of the disk (the part0) and as ->major and ->first_minor will continue to represent devt for the disk, converting these users aren't strictly necessary. However, convert them for consistency. * Implement disk_max_parts() to avoid directly deferencing genhd->minors. * Update bdget_disk() such that it doesn't assume consecutive minor space. * Move devt computation from register_disk() to add_disk() and make it the only one (all other usages use the initially determined value). These changes clean up the code and will help disk->part dereference fix and extended block device numbers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:05 +02:00
Tejun Heo	cf771cb5a7	block: make variable and argument names more consistent In hd_struct, @partno is used to denote partition number and a number of other places use @part to denote hd_struct. Functions use @part and @index instead. This causes confusion and makes it difficult to use consistent variable names for hd_struct. Always use @partno if a variable represents partition number. Also, print out functions use @f or @part for seq_file argument. Use @seqf uniformly instead. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:05 +02:00
Tejun Heo	88e341261c	block: update add_partition() error handling `d805dda4` tried to fix error case handling in add_partition() but had a few problems. * disk->part[] entry is set early and left dangling if operation fails. * Once device initialized, the last put_device() is responsible for freeing all the resources. The failure path freed part_stats and p regardless of put_device() causing double free. * holders subdir holds reference to the disk device, so failure path should remove it to release resources properly which was missing. This patch fixes the above problems and while at it move partition slot busy check into add_partition() for completeness and inlines holders subdirectory creation. Using separate function for it just obfuscates the code. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Abdel Benamrouche <draconux@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:04 +02:00
Tejun Heo	ec2cdedf79	block: allow deleting zero length partition delete_partition() was noop for zero length partition. As the addition code allows creating zero lenght partition and deletion is assumed to always succeed, this causes memory leak for zero length partitions. Allow zero length partitions to end their meaningless lives. While at it, allow deleting zero lenght partition via BLKPG_DEL_PARTITION ioctl too. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:04 +02:00
Mikulas Patocka	5df97b91b5	drop vmerge accounting Remove hw_segments field from struct bio and struct request. Without virtual merge accounting they have no purpose. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:03 +02:00
Mikulas Patocka	b8b3e16cfe	block: drop virtual merging accounting Remove virtual merge accounting. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:03 +02:00
David Woodhouse	8c540a96c1	Let the block device know when sectors can be discarded [hirofumi@mail.parknet.co.jp: discard _after_ checking for corrupt chains] Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:01 +02:00
Steve French	b77d753c41	[CIFS] Check that last search entry resume key is valid Jeff's recent patch to add a last_entry field in the search structure to better construct resume keys did not validate that the server sent us a plausible pointer to the last entry. This adds that. Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-08 19:13:46 +00:00
Trond Myklebust	d7fb120774	NFS: Don't use range_cyclic for data integrity syncs It is more efficient to write linearly starting from the beginning of the file. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 18:19:05 -04:00
Steve Dickson	8491945f11	NFS: Client mounts hang when exported directory do not exist This patch fixes a regression that was introduced by the string based mounts. nfs_mount() statically returns -EACCES for every error returned by the remote mounted. This is incorrect because -EACCES is an non-fatal error to the mount.nfs command. This error causes mount.nfs to retry the mount even in the case when the exported directory does not exist. This patch maps the errors returned by the remote mountd into valid errno values, exactly how it was done pre-string based mounts. By returning the correct errno enables mount.nfs to do the right thing. Signed-off-by: Steve Dickson <steved@redhat.com> [Trond.Myklebust@netapp.com: nfs_stat_to_errno() now correctly returns negative errors, so remove the sign change.] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 18:19:01 -04:00
J. Bruce Fields	ea31a4437c	nfs: Fix misparsing of nfsv4 fs_locations attribute The code incorrectly assumes here that the server name (or ip address) is null-terminated. This can cause referrals to fail in some cases. Also support ipv6 addresses. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 18:17:47 -04:00
J. Bruce Fields	f0c929251e	nfs: prepare to share nfs_set_port We plan to use this function elsewhere. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 18:17:36 -04:00
J. Bruce Fields	460cdbc832	nfs: replace while loop by for loops in nfs_follow_referral Whoever wrote this had a bizarre allergy to for loops. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 18:17:20 -04:00
J. Bruce Fields	4ada29d5c4	nfs: break up nfs_follow_referral This function is a little longer and more deeply nested than necessary. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 18:16:40 -04:00
EG Keizer	37ca8f5c60	nfs: authenticated deep mounting Allow mount to do authenticated mounts below the root of the exported tree. The wording in RFC 2623, sec 2.3.2. allows fsinfo with UNIX authentication on the root of the export. Mounts are not always done on the root of the exported tree. Especially autoumounts often mount below the root of the exported tree. Some server implementations (justly) require full authentication for the so-called deep mounts. The old code used AUTH_SYS only. This caused deep mounts to fail on systems requiring stronger authentication.. The client should try both authentication types and use the first one that succeeds. This method was already partially implemented. This patch completes the implementation for NFS2 and NFS3. This patch was developed to allow Debian systems to automount home directories on Solaris servers with krb5 authentication. Tested on kernel 2.6.24-etchnhalf.1 Signed-off-by: E.G. Keizer <keie@few.vu.nl> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 18:16:22 -04:00
Jeff Layton	f25b874d39	NFS: missing nfs_fattr_init in nfs3_proc_getacl and nfs3_proc_setacls (resend #2 ) The fattrs used in the NFSv3 getacl/setacl calls are not being properly initialized. This occasionally causes nfs_update_inode to fall into NFSv4 specific codepaths when handling post-op attrs from these calls. Thanks to Cai Qian for noticing the spurious NFSv4 messages in debug output from a v3 mount... Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 18:16:22 -04:00
J. Bruce Fields	f200c11c25	nfs: remove an obsolete nfs_flock comment We do now allow bsd flocks over nfs. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 18:16:21 -04:00
Denis V. Lunev	44d5759d3f	nfs: BUG_ON in nfs_follow_mountpoint Unfortunately, BUG_ON(IS_ROOT(dentry)) can happen inside nfs_follow_mountpoint with NFS running Fedora 8 using a specific setup. https://bugzilla.redhat.com/show_bug.cgi?id=458622 So, the situation should be handled on NFS client gracefully. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Trond Myklebust <Trond.Myklebust@netapp.com> CC: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 18:15:16 -04:00
Denis V. Lunev	fd08d7e9d1	nfs: ERR_PTR is expected on failure from nfs_do_clone_mount Replace NULL with ERR_PTR(-EINVAL). Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 18:14:34 -04:00
Adrian Bunk	bb8a3b53c2	fix fs/nfs/nfsroot.c compilation This patch fixes the following compile error caused by commit `f9247273cb` (UFS: add const to parser token tabl): <-- snip --> ... CC fs/nfs/nfsroot.o /home/bunk/linux/kernel-2.6/git/linux-2.6/fs/nfs/nfsroot.c:130: error: tokens causes a section type conflict make[3]: *** [fs/nfs/nfsroot.o] Error 1 <-- snip --> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 17:59:49 -04:00
Trond Myklebust	691beb13cd	NFS: Allow concurrent inode revalidation Currently, if two processes are both trying to revalidate metadata for the same inode, they will find themselves being serialised. There is no good justification for this now that we have improved our ability to detect stale attribute data, so we should remove that serialisation. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 17:59:43 -04:00
Trond Myklebust	2f28ea614f	NFS: Fix up nfs_setattr_update_inode() Ensure that it sets the inode metadata under the correct spinlock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 17:41:46 -04:00
Trond Myklebust	076f1fc94c	NFS: Don't clear nfsi->cache_validity in nfs_check_inode_attributes() If we're merely checking the inode attributes because we suspect that the 'updated' attributes returned by the RPC call are stale, then we shouldn't be doing weak cache consistency updates or clearing the cache_validity flags. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 17:41:33 -04:00
Trond Myklebust	4dc05efb86	NFS: Convert __nfs_revalidate_inode() to use nfs_refresh_inode() In the case where there are parallel RPC calls to the same inode, we may receive stale metadata due to the lack of ordering, hence the sanity checking of metadata in nfs_refresh_inode(). Currently, __nfs_revalidate_inode() is calling nfs_update_inode() directly, without any further sanity checks, and hence may end up setting the inode up with stale metadata. Fix is to use nfs_refresh_inode() instead of nfs_update_inode(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 17:41:17 -04:00
Trond Myklebust	d65f557f39	NFS: Fix nfs_post_op_update_inode_force_wcc() If we believe that the attributes are old (see nfs_refresh_inode()), then we shouldn't force an update. Also ensure that we hold the inode->i_lock across attribute checks and the call to nfs_refresh_inode_locked() to ensure that we don't race with other attribute updates. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 17:41:00 -04:00
Trond Myklebust	a10ad17630	NFS: Fix the NFS attribute update Currently nfs_refresh_inode() will only update the inode metadata if it sees that the RPC call that returned the nfs_fattr was started after the last update of the inode. This means that if we have parallel RPC calls to the same inode (when sending WRITE calls, for instance), we may often miss updates. This patch attempts to recover those missed updates by also accepting them if the ctime in the nfs_fattr is more recent than the inode's cached ctime. It also recovers the case where the file size has increased, but the ctime has not been updated due to limited ctime resolution. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 17:34:17 -04:00
Trond Myklebust	870a5be8b9	NFS: Clean up nfs_refresh_inode() and nfs_post_op_update_inode() Try to avoid taking and dropping the inode->i_lock more than once. Do so by moving the code in nfs_refresh_inode() that needs to be done under the spinlock into a function nfs_refresh_inode_locked(), and then having both nfs_refresh_inode() and nfs_post_op_update_inode() call it directly. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 17:29:49 -04:00
Trond Myklebust	7973c1f15a	NFS: Add mount options for controlling the lookup cache Add the following NFS-specific mount options to the parser. -o lookupcache=all /* Default: cache positive & negative dentries / -o lookupcache=pos[itive] / Don't cache negative dentries / -o lookupcache=none / Strict revalidation of all dentries */ Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 17:23:57 -04:00
Trond Myklebust	ff3525a539	NFS: Don't apply NFS_MOUNT_FLAGMASK to text-based mounts The point of introducing text-based mounts was to allow us to add functionality without having to worry about legacy binary mount formats. The mask should be there in order to ensure that binary formats don't start enabling features that they cannot support. There is no justification for applying it to the text mount path. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 17:23:56 -04:00
Trond Myklebust	4eec952e42	NFS: Add options for finer control of the lookup cache Add the flag NFS_MOUNT_LOOKUP_CACHE_NONEG to turn off the caching of negative dentries. In reality what we do is to force nfs_lookup_revalidate() to always discard negative dentries. Add the flag NFS_MOUNT_LOOKUP_CACHE_NONE for enforcing stricter revalidation of dentries. It forces the revalidate code to always do a lookup instead of just checking the cached mtime of the parent directory. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-07 17:22:20 -04:00
Steve French	0752f1522a	[CIFS] make sure we have the right resume info before calling CIFSFindNext When we do a seekdir() or equivalent, we usually end up doing a FindFirst call and then call FindNext until we get to the offset that we want. The problem is that when we call FindNext, the code usually doesn't have the proper info (mostly, the filename of the entry from the last search) to resume the search. Add a "last_entry" field to the cifs_search_info that points to the last entry in the search. We calculate this pointer by using the LastNameOffset field from the search parms that are returned. We then use that info to do a cifs_save_resume_key before we call CIFSFindNext. This patch allows CIFS to reliably pass the "telldir" connectathon test. Signed-off-by: Jeff Layton <jlayton@redhat.com> CC: Stable <stable@kernel.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-07 20:03:33 +00:00
Steve French	6050247d80	[CIFS] clean up error handling in cifs_unlink Currently, if a standard delete fails and we end up getting -EACCES we try to clear ATTR_READONLY and try the delete again. If that then fails with -ETXTBSY then we try a rename_pending_delete. We aren't handling other errors appropriately though. Another client could have deleted the file in the meantime and we get back -ENOENT, for instance. In that case we wouldn't do a d_drop. Instead of retrying in a separate call, simply goto the original call and use the error handling from that. Also, we weren't properly undoing any attribute changes that were done before returning an error back to the caller. CC: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-07 18:42:52 +00:00
Andi Kleen	39d80c33a0	ext4: Avoid double dirtying of super block in ext4_put_super() While reading code I noticed that ext4_put_super() dirties the superblock bh twice. It is always done in ext4_commit_super() too. Remove the redundant dirty operation. Should be a nop semantically. Signed-off-by: Andi Kleen <ak@linux.intel.com>	2008-10-06 21:37:44 -04:00
Eric Sandeen	6873fa0de1	Hook ext4 to the vfs fiemap interface. ext4_ext_walk_space() was reinstated to be used for iterating over file extents with a callback; it is used by the ext4 fiemap implementation. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org	2008-10-07 00:46:36 -04:00
Trond Myklebust	1daef0a868	NFS: Clean up nfs_sb_active/nfs_sb_deactive Instead of causing umount requests to block on server->active_wq while the asynchronous sillyrename deletes are executing, we can use the sb->s_active counter to obtain a reference to the super_block, and then release that reference in nfs_async_unlink_release(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-06 20:08:26 -04:00
Trond Myklebust	d5e66348bb	NFS: Fix nfs_file_llseek() After the BKL removal patches were applied to the rest of the NFS code, the BKL protection in nfs_file_llseek() is no longer sufficient to ensure that inode->i_size is read safely in generic_file_llseek_unlocked(). In order to fix the situation, we either have to replace the naked read of inode->i_size in generic_file_llseek_unlocked() with i_size_read(), or the whole thing needs to be executed under the inode->i_lock; In order to avoid disrupting other filesystems, avoid touching generic_file_llseek_unlocked() for now... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-10-06 20:08:26 -04:00
Jeff Layton	6b37faa175	[CIFS] fix some settings of cifsAttrs after calling SetFileInfo and SetPathInfo We only need to set them when we call SetFileInfo or SetPathInfo directly, and as soon as possible after then. We had one place setting it where it didn't need to be, and another place where it was missing. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-06 21:54:41 +00:00
Chuck Lever	2937391385	NLM: Remove unused argument from svc_addsock() function Clean up: The svc_addsock() function no longer uses its "proto" argument, so remove it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-04 17:12:27 -04:00
Chuck Lever	26a4140923	NLM: Remove "proto" argument from lockd_up() Clean up: Now that lockd_up() starts listeners for both transports, the "proto" argument is no longer needed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-04 17:12:27 -04:00
Chuck Lever	8c3916f4bd	NLM: Always start both UDP and TCP listeners Commit `24e36663`, which first appeared in 2.6.19, changed lockd so that the client side starts a UDP listener only if there is a UDP NFSv2/v3 mount. Its description notes: This... means that lockd will not listen on UDP if the only mounts are TCP mount (and nfsd hasn't started). The latter is the only one that concerns me at all - I don't know if this might be a problem with some servers. Unfortunately it is a problem for Linux itself. The rpc.statd daemon on Linux uses UDP for contacting the local lockd, no matter which protocol is used for NFS mounts. Without a local lockd UDP listener, NFSv2/v3 lock recovery from Linux NFS clients always fails. Revert parts of commit `24e36663` so lockd_up() always starts both listeners. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-04 17:08:16 -04:00
Josef Bacik	68c9d702bb	generic block based fiemap implementation Any block based fs (this patch includes ext3) just has to declare its own fiemap() function and then call this generic function with its own get_block_t. This works well for block based filesystems that will map multiple contiguous blocks at one time, but will work for filesystems that only map one block at a time, you will just end up with an "extent" for each block. One gotcha is this will not play nicely where there is hole+data after the EOF. This function will assume its hit the end of the data as soon as it hits a hole after the EOF, so if there is any data past that it will not pick that up. AFAIK no block based fs does this anyway, but its in the comments of the function anyway just in case. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org	2008-10-03 17:32:43 -04:00
Mark Fasheh	00dc417fa3	ocfs2: fiemap support Plug ocfs2 into ->fiemap. Some portions of ocfs2_get_clusters() had to be refactored so that the extent cache can be skipped in favor of going directly to the on-disk records. This makes it easier for us to determine which extent is the last one in the btree. Also, I'm not sure we want to be caching fiemap lookups anyway as they're not directly related to data read/write. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: ocfs2-devel@oss.oracle.com Cc: linux-fsdevel@vger.kernel.org	2008-10-03 17:32:11 -04:00
Mark Fasheh	c4b929b85b	vfs: vfs-level fiemap interface Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap inode operation. Userspace can get extent information on a file via fiemap ioctl. As input, the fiemap ioctl takes a struct fiemap which includes an array of struct fiemap_extent (fm_extents). Size of the extent array is passed as fm_extent_count and number of extents returned will be written into fm_mapped_extents. Offset and length fields on the fiemap structure (fm_start, fm_length) describe a logical range which will be searched for extents. All extents returned will at least partially contain this range. The actual extent offsets and ranges returned will be unmodified from their offset and range on-disk. The fiemap ioctl returns '0' on success. On error, -1 is returned and errno is set. If errno is equal to EBADR, then fm_flags will contain those flags which were passed in which the kernel did not understand. On all other errors, the contents of fm_extents is undefined. As fiemap evolved, there have been many authors of the vfs patch. As far as I can tell, the list includes: Kalpak Shah <kalpak.shah@sun.com> Andreas Dilger <adilger@sun.com> Eric Sandeen <sandeen@redhat.com> Mark Fasheh <mfasheh@suse.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: linux-api@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org	2008-10-08 19:44:18 -04:00
Kalpak Shah	4d20c685fa	ext4: fix xattr deadlock ext4_xattr_set_handle() eventually ends up calling ext4_mark_inode_dirty() which tries to expand the inode by shifting the EAs. This leads to the xattr_sem being downed again and leading to a deadlock. This patch makes sure that if ext4_xattr_set_handle() is in the call-chain, ext4_mark_inode_dirty() will not expand the inode. Signed-off-by: Kalpak Shah <kalpak.shah@sun.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-08 23:21:54 -04:00
Theodore Ts'o	45a90bfd90	jbd2: Fix buffer head leak when writing the commit block Also make sure the buffer heads are marked clean before submitting bh for writing. The previous code was marking the buffer head dirty, which would have forced an unneeded write (and seek) to the journal for no good reason. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-06 12:04:02 -04:00
Theodore Ts'o	ede86cc473	ext4: Add debugging markers that can be used by systemtap This debugging markers are designed to debug problems such as the random filesystem latency problems reported by Arjan. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-05 20:50:06 -04:00
Duane Griffin	23f8b79eae	jbd2: abort instead of waiting for nonexistent transaction The __jbd2_log_wait_for_space function sits in a loop checkpointing transactions until there is sufficient space free in the journal. However, if there are no transactions to be processed (e.g. because the free space calculation is wrong due to a corrupted filesystem) it will never progress. Check for space being required when no transactions are outstanding and abort the journal instead of endlessly looping. This patch fixes the bug reported by Sami Liedes at: http://bugzilla.kernel.org/show_bug.cgi?id=10976 Signed-off-by: Duane Griffin <duaneg@dghda.com> Cc: Sami Liedes <sliedes@cc.hut.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-08 23:28:31 -04:00
Frederic Bohe	c806e68f56	ext4: fix initialization of UNINIT bitmap blocks This fixes a bug which caused on-line resizing of filesystems with a 1k blocksize to fail. The root cause of this bug was the fact that if an uninitalized bitmap block gets read in by userspace (which e2fsprogs does try to avoid, but can happen when the blocksize is less than the pagesize and an adjacent blocks is read into memory) ext4_read_block_bitmap() was erroneously depending on the buffer uptodate flag to decide whether it needed to initialize the bitmap block in memory --- i.e., to set the standard set of blocks in use by a block group (superblock, bitmaps, inode table, etc.). Essentially, ext4_read_block_bitmap() assumed it was the only routine that might try to read a block containing a block bitmap, which is simply not true. To fix this, ext4_read_block_bitmap() and ext4_read_inode_bitmap() must always initialize uninitialized bitmap blocks. Once a block or inode is allocated out of that bitmap, it will be marked as initialized in the block group descriptor, so in general this won't result any extra unnecessary work. Signed-off-by: Frederic Bohe <frederic.bohe@bull.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-10 08:09:18 -04:00
Theodore Ts'o	c2ea3fde61	ext4: Remove old legacy block allocator Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-10 09:40:52 -04:00
Theodore Ts'o	240799cdf2	ext4: Use readahead when reading an inode from the inode table With modern hard drives, reading 64k takes roughly the same time as reading a 4k block. So request readahead for adjacent inode table blocks to reduce the time it takes when iterating over directories (especially when doing this in htree sort order) in a cold cache case. With this patch, the time it takes to run "git status" on a kernel tree after flushing the caches via "echo 3 > /proc/sys/vm/drop_caches" is reduced by 21%. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-09 23:53:47 -04:00
Chuck Lever	9a38a83880	lockd: Remove unused fields in the nlm_reboot structure The nlm_reboot structure is used to store information provided by the NSM_NOTIFY procedure. This procedure is not specified by the NLM or NSM protocols, other than to say that the procedure can be used to transmit information private to a particular NLM/NSM implementation. For Linux, the callback arguments include the name of the monitored host, the new NSM state of the host, and a 16-byte private opaque. As a clean up, remove the unused fields and the server-side XDR logic that decodes them. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-03 17:02:35 -04:00
Chuck Lever	b85e467634	lockd: Add helper to sanity check incoming NOTIFY requests lockd accepts SM_NOTIFY calls only from a privileged process on the local system. If lockd uses an AF_INET6 listener, the sender's address (ie the local rpc.statd) will be the IPv6 loopback address, not the IPv4 loopback address. Make sure the privilege test in nlmsvc_proc_sm_notify() and nlm4svc_proc_sm_notify() works for both AF_INET and AF_INET6 family addresses by refactoring the test into a helper and adding support for IPv6 addresses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-03 17:02:35 -04:00
Chuck Lever	dcff09f124	lockd: change nlmclnt_grant() to take a "struct sockaddr " Adjust the signature and callers of nlmclnt_grant() to pass a "struct sockaddr " instead of a "struct sockaddr_in *" in order to support IPv6 addresses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-03 17:02:35 -04:00
Chuck Lever	6bfbe8af46	lockd: Adjust nlmsvc_lookup_host() to accomodate AF_INET6 addresses Fix up nlmsvc_lookup_host() to pass AF_INET6 source addresses to nlm_lookup_host(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-03 17:02:35 -04:00
Chuck Lever	d7d204403b	lockd: Adjust nlmclnt_lookup_host() signature to accomodate non-AF_INET Pass a struct sockaddr * and a length to nlmclnt_lookup_host() to accomodate non-AF_INET family addresses. As a side benefit, eliminate the hostname_len argument, as the hostname is always NUL-terminated. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-03 17:02:34 -04:00
Chuck Lever	88541c8487	lockd: Support non-AF_INET addresses in nlm_lookup_host() Use struct sockaddr * and length in nlm_lookup_host_info to all callers to pass in either AF_INET or AF_INET6 addresses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-03 17:01:57 -04:00
Chuck Lever	7f1ed18bd3	NLM: Convert nlm_lookup_host() to use a single argument The nlm_lookup_host() function already has a large number of arguments, and I'm about to add a few more. As a clean up, convert the function to use a single data structure argument. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-03 16:58:23 -04:00
J. Bruce Fields	d22b1cff09	lockd: reject reclaims outside the grace period The current lockd does not reject reclaims that arrive outside of the grace period. Accepting a reclaim means promising to the client that no conflicting locks were granted since last it held the lock. We can meet that promise if we assume the only lockers are nfs clients, and that they are sufficiently well-behaved to reclaim only locks that they held before, and that only reclaim locks have been permitted so far. Once we leave the grace period (and start permitting non-reclaims), we can no longer keep that promise. So we must start rejecting reclaims at that point. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-03 16:19:20 -04:00
J. Bruce Fields	b2b5028905	lockd: move grace period checks to common code Do all the grace period checks in svclock.c. This simplifies the code a bit, and will ease some later changes. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-03 16:19:19 -04:00
J. Bruce Fields	af558e33be	nfsd: common grace period control Rewrite grace period code to unify management of grace period across lockd and nfsd. The current code has lockd and nfsd cooperate to compute a grace period which is satisfactory to them both, and then individually enforce it. This creates a slight race condition, since the enforcement is not coordinated. It's also more complicated than necessary. Here instead we have lockd and nfsd each inform common code when they enter the grace period, and when they're ready to leave the grace period, and allow normal locking only after both of them are ready to leave. We also expect the locks_start_grace()/locks_end_grace() interface here to be simpler to build on for future cluster/high-availability work, which may require (for example) putting individual filesystems into grace, or enforcing grace periods across multiple cluster nodes. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-10-03 16:19:02 -04:00
Nick Piggin	4b19de6d1c	mm: tiny-shmem nommu fix The previous patch `db203d53d4` ("mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU architectures because it was using the expanding truncate to signal ramfs to allocate a physically contiguous RAM backing the inode (otherwise it is unusable for "memory mapping" it to userspace). However do_truncate is what caused the lock ordering error, due to it taking i_mutex. In this case, we can actually just call ramfs directly to allocate memory for the mapping, rather than go via truncate. Acked-by: David Howells <dhowells@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-02 15:53:13 -07:00
Nick Piggin	16dbc6c961	inotify: fix lock ordering wrt do_page_fault's mmap_sem Fix inotify lock order reversal with mmap_sem due to holding locks over copy_to_user. Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-by: "Daniel J Blueman" <daniel.blueman@gmail.com> Tested-by: "Daniel J Blueman" <daniel.blueman@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-02 15:53:13 -07:00
Adrian Hunter	be2f6bd62d	UBIFS: check buffer length when scanning for LPT nodes 'is_a_node()' function was reading from a buffer before checking the buffer length, resulting in an OOPS as follows: BUG: unable to handle kernel paging request at f8f74002 IP: [<f8f9783f>] :ubifs:ubifs_unpack_bits+0xca/0x233 pde = 19e95067 pte = 00000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: ubifs ubi mtdchar bio2mtd mtd brd video output [last unloaded: mtd] Pid: 6414, comm: integck Not tainted (2.6.27-rc6ubifs34 #23) EIP: 0060:[<f8f9783f>] EFLAGS: 00010246 CPU: 0 EIP is at ubifs_unpack_bits+0xca/0x233 [ubifs] EAX: 00000000 EBX: f6090630 ECX: d9badcfc EDX: 00000000 ESI: 00000004 EDI: f8f74002 EBP: d9badcec ESP: d9badcc0 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process integck (pid: 6414, ti=d9bac000 task=f727dae0 task.ti=d9bac000) Stack: 00000006 f7306240 00000002 00000000 d9badcfc d9badd00 0000001c 00000000 f6090630 f6090630 f8f74000 d9badd10 f8fa1cc9 00000000 f8f74002 00000000 f8f74002 f60fe128 f6090630 f8f74000 d9badd68 f8fa1e46 00000000 0001e000 Call Trace: [<f8fa1cc9>] ? is_a_node+0x30/0x90 [ubifs] [<f8fa1e46>] ? dbg_check_ltab+0x11d/0x5bd [ubifs] [<f8fa388f>] ? ubifs_lpt_start_commit+0x42/0xed3 [ubifs] [<c038e76a>] ? mutex_unlock+0x8/0xa [<f8f9625d>] ? ubifs_tnc_start_commit+0x1c8/0xedb [ubifs] [<f8f8d90b>] ? do_commit+0x187/0x523 [ubifs] [<c038e76a>] ? mutex_unlock+0x8/0xa [<f8f7ca17>] ? bud_wbuf_callback+0x22/0x28 [ubifs] [<f8f8dd1d>] ? ubifs_run_commit+0x76/0xc0 [ubifs] [<f8f8032c>] ? ubifs_sync_fs+0xd2/0xe6 [ubifs] [<c01a2e97>] ? vfs_quota_sync+0x0/0x17e [<c01a5ba6>] ? quota_sync_sb+0x26/0xbb [<c01a2e97>] ? vfs_quota_sync+0x0/0x17e [<c01a5c5d>] ? sync_dquots+0x22/0x12c [<c0173d1b>] ? __fsync_super+0x19/0x68 [<c0173d75>] ? fsync_super+0xb/0x19 [<c0174065>] ? generic_shutdown_super+0x22/0xe7 [<c01a31fc>] ? vfs_quota_off+0x0/0x5fd [<f8f7cf4d>] ? ubifs_kill_sb+0x31/0x35 [ubifs] [<c01741f9>] ? deactivate_super+0x5e/0x71 [<c0187610>] ? mntput_no_expire+0x82/0xe4 [<c0187905>] ? sys_umount+0x4c/0x2f6 [<c0187bc8>] ? sys_oldumount+0x19/0x1b [<c0103b71>] ? sysenter_do_call+0x12/0x25 ======================= Code: c1 f8 03 8d 04 07 8b 4d e8 89 01 8b 45 e4 89 10 89 d8 89 f1 d3 e8 85 c0 74 07 29 d6 83 fe 20 75 2a 89 d8 83 c4 20 5b 5e 5f 5d EIP: [<f8f9783f>] ubifs_unpack_bits+0xca/0x233 [ubifs] SS:ESP 0068:d9badcc0 ---[ end trace 1f02572436518c13 ]--- Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-09-30 11:12:59 +03:00
Adrian Hunter	63c300b68f	UBIFS: correct condition to eliminate unecessary assignment Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-09-30 11:12:59 +03:00
Adrian Hunter	73944a6de0	UBIFS: add more debugging messages for LPT Also add debugging checks for LPT size and separate out c->check_lpt_free from unrelated bitfields. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-09-30 11:12:59 +03:00
Adrian Hunter	5c0013c16b	UBIFS: fix bulk-read handling uptodate pages Bulk-read skips uptodate pages but this was putting its array index out and causing it to treat subsequent pages as holes. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-09-30 11:12:59 +03:00
Adrian Hunter	46773be497	UBIFS: improve garbage collection Make garbage collection try to keep data nodes from the same inode together and in ascending order. This improves performance when reading those nodes especially when bulk-read is used. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-09-30 11:12:59 +03:00
Adrian Hunter	bed79935de	UBIFS: allow for sync_fs when read-only sync_fs can be called even if the file system is mounted read-only. Ensure the commit is not run in that case. Reported-by: Zoltan Sogor <weth@inf.u-szeged.hu> Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-09-30 11:12:58 +03:00
Artem Bityutskiy	403e12ab30	UBIFS: commit on sync_fs Commit the journal when the FS is sync'ed. This will make statfs provide better free space report. And we anyway advice our users to sync the FS if they want better statfs report. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-09-30 11:12:58 +03:00
Artem Bityutskiy	af2eb5637b	UBIFS: correct comment for commit_on_unmount Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-09-30 11:12:58 +03:00
Artem Bityutskiy	b5e426e9a4	UBIFS: update dbg_dump_inode 'dbg_dump_inode()' is quite outdated and does not print all the fileds. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-09-30 11:12:58 +03:00
Artem Bityutskiy	be61678b1d	UBIFS: fix commentary Znode may refer both data nodes and indexing nodes Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-09-30 11:12:58 +03:00
Artem Bityutskiy	ba60ecabf0	UBIFS: fix races in bit-fields We cannot store bit-fields together if the processes which change them may race, unless we serialize them. Thus, move the nospc and nospc_rp bit-fields eway from the mount option/constant bit-fields, to avoid races. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-09-30 11:12:58 +03:00
Adrian Hunter	ed382d5898	UBIFS: ensure data read beyond i_size is zeroed out correctly Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-09-30 11:12:57 +03:00
Adrian Hunter	2094c334fd	UBIFS: correct key comparison The comparison was working, but more by accident than design. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-09-30 11:12:57 +03:00
Artem Bityutskiy	625bf371c1	UBIFS: use bit-fields when possible The "bulk_read" and "no_chk_data_crc" have only 2 values - 0 and 1. We already have bit-fields in corresponding data structers, so make "bulk_read" and "no_chk_data_crc" bit-fields as well. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-09-30 11:12:57 +03:00
Artem Bityutskiy	ccb3eba724	UBIFS: check data CRC when in error state When UBIFS switches to R/O mode because of an error, it is reasonable to enable data CRC checking. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-09-30 11:12:57 +03:00
Adrian Hunter	2242c689ec	UBIFS: improve znode splitting rules When inserting into a full znode it is split into two znodes. Because data node keys are usually consecutive, it is better to try to keep them together. This patch does a better job of that. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-09-30 11:12:57 +03:00
Adrian Hunter	2953e73f1c	UBIFS: add no_chk_data_crc mount option UBIFS read performance can be improved by skipping the CRC check when data nodes are read. This option can be used if the underlying media is considered to be highly reliable. Note that CRCs are always checked for metadata. Read speed on Arm platform with OneNAND goes from 19 MiB/s to 27 MiB/s with data CRC checking disabled. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-09-30 11:12:56 +03:00
Adrian Hunter	4793e7c5e1	UBIFS: add bulk-read facility Some flash media are capable of reading sequentially at faster rates. UBIFS bulk-read facility is designed to take advantage of that, by reading in one go consecutive data nodes that are also located consecutively in the same LEB. Read speed on Arm platform with OneNAND goes from 17 MiB/s to 19 MiB/s. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-09-30 11:12:56 +03:00
Julien Brunel	a70948b564	UBIFS: use an IS_ERR test rather than a NULL test In case of error, the function kthread_create returns an ERR pointer, but never returns a NULL pointer. So a NULL test that comes before an IS_ERR test should be deleted. The semantic match that finds this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @match_bad_null_test@ expression x, E; statement S1,S2; @@ x = kthread_create(...) ... when != x = E * if (x == NULL) S1 else S2 // </smpl> Signed-off-by: Julien Brunel <brunel@diku.dk> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-09-30 11:12:56 +03:00
Artem Bityutskiy	746103aca2	UBIFS: inline one-line functions 'ubifs_get_lprops()' and 'ubifs_release_lprops()' basically wrap mutex lock and unlock. We have them because we want lprops subsystem be separate and as independent as possible. And we planned better locking rules for lprops. Anyway, because they are short, it is better to inline them. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-09-30 11:12:56 +03:00
Hirofumi Nakagawa	8d47aef43b	UBIFS: remove unneeded unlikely() IS_ERR() macro already has unlikely(), so do not use constructions like 'if (unlikely(IS_ERR())'. Signed-off-by: Hirofumi Nakagawa <hnakagawa@miraclelinux.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-09-30 11:12:55 +03:00
Artem Bityutskiy	948cfb219b	UBIFS: add a print, fix comments and more minor stuff This commit adds a reserved pool size print and tweaks the prints to make them look nicer. It also fixes and cleans-up some comments. Additionally, it deletes some blank lines to make the code look a little nicer. In other words, nothing essential. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-09-30 11:12:55 +03:00
Benny Halevy	d5b337b487	nfsd: use nfs client rpc callback program since commit `ff7d9756b5` "nfsd: use static memory for callback program and stats" do_probe_callback uses a static callback program (NFS4_CALLBACK) rather than the one set in clp->cl_callback.cb_prog as passed in by the client in setclientid (4.0) or create_session (4.1). This patches introduces rpc_create_args.prognumber that allows overriding program->number when creating rpc_clnt. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:40 -04:00
Benny Halevy	97eb89bb0e	nfsd: do_probe_callback should not clear rpc stats Now that cb_stats are static (since commit `ff7d9756b5`) there's no need to clear them. Initially I thought it might make sense to do that every callback probing but since the stats are per-program and they are shared between possibly several client callback instances, zeroing them out seems like the wrong thing to do. Note that that commit also introduced a bug since stats.program is also being cleared in the process and it is not restored after the memset as it used to be. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:40 -04:00
Chuck Lever	e018040a82	lockd: Update nsm_find() to support non-AF_INET addresses Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:39 -04:00
Chuck Lever	bc48e4d637	lockd: Combine __nsm_find() and nsm_find(). Clean up: Having two separate functions doesn't add clarity, so eliminate one of them. Use contemporary kernel coding conventions where appropriate. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:39 -04:00
Chuck Lever	ede2fea099	lockd: Support AF_INET6 when hashing addresses in nlm_lookup_host Adopt an approach similar to the RPC server's auth cache (from Aurelien Charbon and Brian Haley). Note nlm_lookup_host()'s existing IP address hash function has the same issue with correctness on little-endian systems as the original IPv4 auth cache hash function, so I've also updated it with a hash function similar to the new auth cache hash function. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:39 -04:00
Chuck Lever	781b61a6f4	lockd: Teach nlm_cmp_addr() to support AF_INET6 addresses Update the nlm_cmp_addr() helper to support AF_INET6 as well as AF_INET addresses. New version takes two "struct sockaddr " arguments instead of "struct sockaddr_in " arguments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:39 -04:00
Chuck Lever	7e9d7746bf	NSM: Use sockaddr_storage for sm_addr field To store larger addresses in the nsm_handle structure, make sm_addr a sockaddr_storage. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:39 -04:00
Chuck Lever	90151e6e4d	lockd: Use sockaddr_storage for h_saddr field To store larger addresses in the nlm_host structure, make h_saddr a sockaddr_storage. And let's call it something more self-explanatory: "saddr" could easily be mistaken for "server address". Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:39 -04:00
Chuck Lever	b4ed58fd34	lockd: Use sockaddr_storage + length for h_addr field To store larger addresses in the nlm_host structure, make h_addr a sockaddr_storage, and add an address length field. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:39 -04:00
Chuck Lever	396cb3d003	lockd: Add address family-agnostic helper for zeroing the port number Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:38 -04:00
Chuck Lever	2860a0227b	lockd: Specify address family for source address Make sure an address family is specified for source addresses passed to nlm_lookup_host(). nlm_lookup_host() will need this when it becomes capable of dealing with AF_INET6 addresses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:38 -04:00
Chuck Lever	1b333c54a1	lockd: address-family independent printable addresses Knowing which source address is used for communicating with remote NLM services can be helpful for debugging configuration problems on hosts with multiple addresses. Keep the dprintk debugging here, but adapt it so it displays AF_INET6 addresses properly. There are also a couple of dprintk clean-ups as well. At some point we will aggregate the helpers that display presentation format addresses into a single set of shared helpers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:38 -04:00
Chuck Lever	c2526f4271	NLM: Clean up before introducing new debugging messages We're about to introduce some extra debugging messages in nlm_lookup_host(). Bring the coding style up to date first so we can cleanly introduce the new debugging messages. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:38 -04:00
Chuck Lever	a26cfad6e0	SUNRPC: Support IPv6 when registering kernel RPC services In order to advertise NFS-related services on IPv6 interfaces via rpcbind, the kernel RPC server implementation must use rpcb_v4_register() instead of rpcb_register(). A new kernel build option allows distributions to use the legacy v2 call until they integrate an appropriate user-space rpcbind daemon that can support IPv6 RPC services. I tried adding some automatic logic to fall back if registering with a v4 protocol request failed, but there are too many corner cases. So I just made it a compile-time switch that distributions can throw when they've replaced portmapper with rpcbind. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 18:13:38 -04:00
J. Bruce Fields	c8ab5f2a13	lockd: don't depend on lockd main loop to end grace End lockd's grace period using schedule_delayed_work() instead of a check on every pass through the main loop. After a later patch, we'll depend on lockd to end its grace period even if it's not currently handling requests; so it shouldn't depend on being woken up from the main loop to do so. Also, Nakano Hiroaki (who independently produced a similar patch) noticed that the current behavior is buggy in the face of jiffies wraparound: "lockd uses time_before() to determine whether the grace period has expired. This would seem to be enough to avoid timer wrap-around issues, but, unfortunately, that is not the case. The time_* family of comparison functions can be safely used to compare jiffies relatively close in time, but they stop working after approximately LONG_MAX/2 ticks. nfsd can suffer this problem because the time_before() comparison in lockd() is not performed until the first request comes in, which means that if there is no lockd traffic for more than LONG_MAX/2 ticks we are screwed. "The implication of this is that once time_before() starts misbehaving any attempt from a NFS client to execute fcntl() will be received with a NLM_LCK_DENIED_GRACE_PERIOD message for 25 days (assuming HZ=1000). In other words, the 50 seconds grace period could turn into a grace period of 50 days or more. "Note: This bug was analyzed independently by Oda-san <oda@valinux.co.jp> and myself." Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Nakano Hiroaki <nakano.hiroaki@oss.ntt.co.jp> Cc: Itsuro Oda <oda@valinux.co.jp>	2008-09-29 18:13:10 -04:00
J. Bruce Fields	8fafa90082	locks: allow lockd to process blocked locks during grace period The check here is currently harmless but unnecessary, since, as the comment notes, there aren't any blocked-lock callbacks to process during the grace period anyway. And eventually we want to allow multiple grace periods that come and go for different filesystems over the course of the lifetime of lockd, at which point this check is just going to get in the way. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 17:56:59 -04:00
Jeff Layton	54a66e5480	knfsd: allocate readahead cache in individual chunks I had a report from someone building a large NFS server that they were unable to start more than 585 nfsd threads. It was reported against an older kernel using the slab allocator, and I tracked it down to the large allocation in nfsd_racache_init failing. It appears that the slub allocator handles large allocations better, but large contiguous allocations can often be problematic. There doesn't seem to be any reason that the racache has to be allocated as a single large chunk. This patch breaks this up so that the racache is built up from separate allocations. (Thanks also to Takashi Iwai for a bugfix.) Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Takashi Iwai <tiwai@suse.de>	2008-09-29 17:56:59 -04:00
Benny Halevy	e31a1b662f	nfsd: nfs4xdr decode_stateid helper function Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 17:56:59 -04:00

... 5 6 7 8 9 ...

10727 Commits