linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Oleg Nesterov	8cd9c24912	coredump: simplify core_state->nr_threads calculation Change zap_process() to return int instead of incrementing mm->core_state->nr_threads directly. Change zap_threads() to set mm->core_state only on success. This patch restores the original size of .text, and more importantly now ->nr_threads is used in two places only. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	999d9fc167	coredump: move mm->core_waiters into struct core_state Move mm->core_waiters into "struct core_state" allocated on stack. This shrinks mm_struct a little bit and allows further changes. This patch mostly does s/core_waiters/core_state. The only essential change is that coredump_wait() must clear mm->core_state before return. The coredump_wait()'s path is uglified and .text grows by 30 bytes, this is fixed by the next patch. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	32ecb1f26d	coredump: turn mm->core_startup_done into the pointer to struct core_state mm->core_startup_done points to "struct completion startup_done" allocated on the coredump_wait()'s stack. Introduce the new structure, core_state, which holds this "struct completion". This way we can add more info visible to the threads participating in coredump without enlarging mm_struct. No changes in affected .o files. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	24d5288f06	coredump: elf_core_dump: skip kernel threads linux_binfmt->core_dump() runs before the process does exit_aio(), this means that we can hit the kernel thread which shares the same ->mm. Afaics, nothing really bad can happen, but perhaps it makes sense to fix this minor bug. It is sad we have to iterate over all threads in system and use GFP_ATOMIC. Hopefully we can kill theses ugly do_each_thread()s, but this needs some nontrivial changes in mm_struct and do_coredump. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	15b9f360c0	coredump: zap_threads() must skip kernel threads The main loop in zap_threads() must skip kthreads which may use the same mm. Otherwise we "kill" this thread erroneously (for example, it can not fork or exec after that), and the coredumping task stucks in the TASK_UNINTERRUPTIBLE state forever because of the wrong ->core_waiters count. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	246bb0b1de	kill PF_BORROWED_MM in favour of PF_KTHREAD Kill PF_BORROWED_MM. Change use_mm/unuse_mm to not play with ->flags, and do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users. No functional changes yet. But this allows us to do further fixes/cleanups. oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the kthreads, this is wrong because of use_mm(). The problem with PF_BORROWED_MM is that we need task_lock() to avoid races. With this patch we can check PF_KTHREAD directly, or use a simple lockless helper: /* The result must not be dereferenced !!! / struct mm_struct __get_task_mm(struct task_struct *tsk) { if (tsk->flags & PF_KTHREAD) return NULL; return tsk->mm; } Note also ecard_task(). It runs with ->mm != NULL, but it's the kernel thread without PF_BORROWED_MM. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	7b34e4283c	introduce PF_KTHREAD flag Introduce the new PF_KTHREAD flag to mark the kernel threads. It is set by INIT_TASK() and copied to the forked childs (we could set it in kthreadd() along with PF_NOFREEZE instead). daemonize() was changed as well. In that case testing of PF_KTHREAD is racy, but daemonize() is hopeless anyway. This flag is cleared in do_execve(), before search_binary_handler(). Probably not the best place, we can do this in exec_mmap() or in start_thread(), or clear it along with PF_FORKNOEXEC. But I think this doesn't matter in practice, and if do_execve() fails kthread should die soon. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	e4901f92a8	coredump: zap_threads: comments && use while_each_thread() No changes in fs/exec.o The for_each_process() loop in zap_threads() is very subtle, it is not clear why we don't race with fork/exit/exec. Add the fat comment. Also, change the code to use while_each_thread(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Jan Kara	657d3bfa98	quota: implement sending information via netlink about user below quota Sometimes it may be useful for userspace to know (e.g. for some hosting guys) that some user stopped exceeding his hardlimit or softlimit in quotas. Implement sending of such events to userspace via quota netlink protocol so that they don't have to poll for such events. Based on idea and initial implementation by Vladislav Bogdanov. Cc: Vladislav Bogdanov <slava@nsys.by> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:35 -07:00
Jan Kara	74abb9890d	quota: move function-macros from quota.h to quotaops.h Move declarations of some macros, which should be in fact functions to quotaops.h. This way they can be later converted to inline functions because we can now use declarations from quota.h. Also add necessary includes of quotaops.h to a few files. [akpm@linux-foundation.org: fix JFS build] [akpm@linux-foundation.org: fix UFS build] [vegard.nossum@gmail.com: fix QUOTA=n build] Signed-off-by: Jan Kara <jack@suse.cz> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Arjen Pool <arjenpool@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:35 -07:00
Jan Kara	02a55ca871	quota: cleanup loop in sync_dquots() Make loop in sync_dquots() checking whether there's something to write more readable, remove useless variable and macro info_any_dirty() which is used only in this place. Signed-off-by: Jan Kara <jack@suse.cz> Cc: "Vegard Nossum" <vegard.nossum@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:35 -07:00
Jan Kara	b85f4b87a5	quota: rename quota functions from upper case, make bigger ones non-inline Cleanup quotaops.h: Rename functions from uppercase to lowercase (and define backward compatibility macros), move larger functions to dquot.c and make them non-inline. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:35 -07:00
Jan Kara	b48d380541	quota: fix possible infinite loop in quota code When quota structure is going to be dropped and it is dirty, quota code tries to write it. If the write fails for some reason (e. g. transaction cannot be started because the journal is aborted), we try writing again and again and again... Fix the problem by clearing the dirty bit even if the write failed. (akpm: for 2.6.27, 2.6.26.x and 2.6.25.x) Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: dingdinghua <dingdinghua85@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:34 -07:00
Joe Peterson	b271e067c8	fatfs: add UTC timestamp option Provide a new mount option ("tz=UTC") for DOS (vfat/msdos) filesystems, allowing timestamps to be in coordinated universal time (UTC) rather than local time in applications where doing this is advantageous. In particular, portable devices that use fat/vfat (such as digital cameras) can benefit from using UTC in their internal clocks, thus avoiding daylight saving time errors and general time ambiguity issues. The user of the device does not have to worry about changing the time when moving from place or when daylight saving changes. The new mount option, when set, disables the counter-adjustment that Linux currently makes to FAT timestamp info in anticipation of the normal userspace time zone correction. When used in this new mode, all daylight saving time and time zone handling is done in userspace as is normal for many other filesystems (like ext3). The default mode, which remains unchanged, is still appropriate when mounting volumes written in Windows (because of its use of local time). I originally based this patch on one submitted last year by Paul Collins, but I updated it to work with current source and changed variable/option naming. Ogawa Hirofumi (who maintains these filesystems) and I discussed this patch at length on lkml, and he suggested using the option name in the attached version of the patch. Barry Bouwsma pointed out a good addition to the patch as well. Signed-off-by: Joe Peterson <joe@skyrush.com> Signed-off-by: Paul Collins <paul@ondioline.org> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Barry Bouwsma <free_beer_for_all@yahoo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:34 -07:00
Adrian Bunk	e8938a62a8	remove unused #include <linux/dirent.h>'s Remove some unused #include <linux/dirent.h>'s. Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:34 -07:00
Rene Scharfe	7557bc66be	msdos fs: remove unsettable atari option It has been impossible to set the option 'atari' of the MSDOS filesystem for several years. Since nobody seems to have missed it, let's remove its remains. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:34 -07:00
OGAWA Hirofumi	dcd8c53f13	fat: small optimization to __fat_readdir() This removes unnecessary parsing for directory entries. If short_only, we don't need to parse longname. And if !both and it found the longname, we don't need shortname. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:34 -07:00
OGAWA Hirofumi	98a1516004	fat: use same logic in fat_search_long() and __fat_readdir() This uses uses stack for shortname, and uses __getname() for longname in fat_search_long() and __fat_readdir(). By this, it removes unneeded __getname() for shortname. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:34 -07:00
OGAWA Hirofumi	d688611674	fat: cleanup fs/fat/dir.c This is no logic changes, just cleans fs/fat/dir.c up. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:34 -07:00
Adrian Bunk	531f710f8e	fat/dir.c: switch to struct __fat_dirent struct __fat_dirent is what was formerly the kernel struct dirent (that was different from the userspace struct dirent). Converting all fat users to struct __fat_dirent will allow us to get rid of the conflicting struct dirent definition. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:34 -07:00
OGAWA Hirofumi	8d44d9741f	fat: fix parse_options() Current parse_options() exits too early. We need to run the code of bottom in this function even if users doesn't specify options. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:34 -07:00
Shen Feng	3264d4ded4	reiserfs: remove double definitions of xattr macros remove the definitions of macros: XATTR_SECURITY_PREFIX XATTR_TRUSTED_PREFIX XATTR_USER_PREFIX since they are defined in linux/xattr.h Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Jeff Mahoney	90415deac7	reiserfs: convert j_commit_lock to mutex j_commit_lock is a semaphore but uses it as if it were a mutex. This patch converts it to a mutex. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jeff Mahoney <jeffm@suse.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Chris Mason <chris.mason@oracle.com> Cc: Edward Shishkin <edward.shishkin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Jeff Mahoney	afe7025907	reiserfs: convert j_flush_sem to mutex j_flush_sem is a semaphore but uses it as if it were a mutex. This patch converts it to a mutex. [akpm@linux-foundation.org: fix mutex_trylock retval treatment] Signed-off-by: Jeff Mahoney <jeffm@suse.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Chris Mason <chris.mason@oracle.com> Cc: Edward Shishkin <edward.shishkin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Jeff Mahoney	f68215c464	reiserfs: convert j_lock to mutex j_lock is a semaphore but uses it as if it were a mutex. This patch converts it to a mutex. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Chris Mason <chris.mason@oracle.com> Cc: Edward Shishkin <edward.shishkin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Jan Kara	00b441970a	reiserfs: correct mount option parsing to detect when quota options can be changed We should not allow user to change quota mount options when quota is just suspended. It would make mount options and internal quota state inconsistent. Also we should not allow user to change quota format when quota is turned on. On the other hand we can just silently ignore when some option is set to the value it already has (some mount versions do this on remount). Finally, we should not discard current quota options if parsing of mount options fails. Cc: <reiserfs-devel@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Jan Kara	4506567b24	reiserfs: fix typos in messages and comments (journalled -> journaled) Cc: <reiserfs-devel@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Jan Kara	5d4f7fddf8	reiserfs: fix synchronization of quota files in journal=data mode In journal=data mode, it is not enough to do write_inode_now() as done in vfs_quota_on() to write all data to their final location (which is needed for quota_read to work correctly). Calling journal_end_sync() before calling vfs_quota_on() does it's job because transactions are committed to the journal and data marked as dirty in memory so write_inode_now() writes them to their final locations. Cc: <reiserfs-devel@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Matthias Kaehlcke	895c23f8c3	hfsplus: convert the extents_lock in a mutex Apple Extended HFS file system: The semaphore extents lock is used as a mutex. Convert it to the mutex API. Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Matthias Kaehlcke	39f8d472f2	hfs: convert extents_lock in a mutex Apple Macintosh file system: The semaphore extens_lock is used as a mutex. Convert it to the mutex API Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Matthias Kaehlcke	3084b72de7	hfs: convert bitmap_lock in a mutex Apple Macintosh file system: The semaphore bitmap_lock is used as a mutex. Convert it to the mutex API Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Adrian Bunk	de0ca06a99	coda: remove CODA_FS_OLD_API While fixing CONFIG_ leakages to the userspace kernel headers I ran into CODA_FS_OLD_API. After five years, are there still people using the old API left? Especially considering that you have to choose at compile time which API to support in the kernel (and distributions tend to offer the new API for some time). Jan: "The old API can definitely go. Around the time the new interface went in there were some non-Coda userspace file system implementations that took a while longer to convert to the new API, but by now they all switched to the new interface or in some cases to a FUSE-based solution." Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Adam Greenblatt	c0a1633b62	isofs: fix minor filesystem corruption Some iso9660 images contain files with rockridge data that is either incorrect or incompletely parsed. Prior to commit `f2966632a1` ("[PATCH] rock: handle directory overflows") (included with kernel 2.6.13) the kernel ignored the rockridge data for these files, while still allowing the files to be accessed under their non-rockridge names. That commit inadvertently changed things so that files with invalid rockridge data could not be accessed at all. (I ran across the problem when comparing some old CDs with hard disk copies I had made long ago under kernel 2.4: a few of the files on the hard disk copies were no longer visible on the CDs.) This change reverts to the pre-2.6.13 behavior. Signed-off-by: Adam Greenblatt <adam.greenblatt@gmail.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Duane Griffin	275c0a8f12	ext3: validate directory entry data before use ext3_dx_find_entry uses ext3_next_entry without verifying that the entry is valid. If its rec_len == 0 this causes an infinite loop. Refactor the loop to check the validity of entries before checking whether they match and moving onto the next one. There are other uses of ext3_next_entry in this file which also look problematic. They should be reviewed and fixed if/when we have a test-case that triggers them. This patch fixes the first case (image hdb.25.softlockup.gz) reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882. Signed-off-by: Duane Griffin <duaneg@dghda.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:33 -07:00
Hidehiro Kawai	cbe5f466f6	jbd: don't abort if flushing file data failed In ordered mode, the current jbd aborts the journal if a file data buffer has an error. But this behavior is unintended, and we found that it has been adopted accidentally. This patch undoes it and just calls printk() instead of aborting the journal. Additionally, set AS_EIO into the address_space object of the failed buffer which is submitted by journal_do_submit_data() so that fsync() can get -EIO. Missing error checkings are also added to inform errors on file data buffers to the user. The following buffers are targeted. (a) the buffer which has already been written out by pdflush (b) the buffer which has been unlocked before scanned in the t_locked_list loop [akpm@linux-foundation.org: improve grammar in a printk] Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Li Zefan	8ef2720397	ext3: kill 2 useless magic numbers dx_root_limit() will never return 20, and I can't figure out what 20 stands for. This function has never changed since htree directory indexing was merged. Similar for dx_node_limit() and the magic 22. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Toshiyuki Okajima	fc80c44277	jbd: positively dispose the unmapped data buffers in journal_commit_transaction() After ext3-ordered files are truncated, there is a possibility that the pages which cannot be estimated still remain. Remaining pages can be released when the system has really few memory. So, it is not memory leakage. But the resource management software etc. may not work correctly. It is possible that journal_unmap_buffer() cannot release the buffers, and the pages to which they belong because they are attached to a commiting transaction and journal_unmap_buffer() cannot release them. To release such the buffers and the pages later, journal_unmap_buffer() leaves it to journal_commit_transaction(). (journal_unmap_buffer() puts the mark 'BH_Freed' to the buffers so that journal_commit_transaction() can identify whether they can be released or not.) In the journalled mode and the writeback mode, jbd does with only metadata buffers. But in the ordered mode, jbd does with metadata buffers and also data buffers. Actually, journal_commit_transaction() releases only the metadata buffers of which release is demanded by journal_unmap_buffer(), and also releases the pages to which they belong if possible. As a result, the data buffers of which release is demanded by journal_unmap_buffer() remain after a transaction commits. And also the pages to which they belong remain. Such the remained pages don't have mapping any longer. Due to this fact, there is a possibility that the pages which cannot be estimated remain. The metadata buffers marked 'BH_Freed' and the pages to which they belong can be released at 'JBD: commit phase 7'. Therefore, by applying the same code into 'JBD: commit phase 2' (where the data buffers are done with), journal_commit_transaction() can also release the data buffers marked 'BH_Freed' and the pages to which they belong. As a result, all the buffers marked 'BH_Freed' can be released, and also all the pages to which these buffers belong can be released at journal_commit_transaction(). So, the page which cannot be estimated is lost. <<Excerpt of code at 'JBD: commit phase 7'>> > spin_lock(&journal->j_list_lock); > while (commit_transaction->t_forget) { > transaction_t cp_transaction; > struct buffer_head bh; > > jh = commit_transaction->t_forget; >... > if (buffer_freed(bh)) { > ^^^^^^^^^^^^^^^^^^^^^^^^ > clear_buffer_freed(bh); > ^^^^^^^^^^^^^^^^^^^^^^^^ > clear_buffer_jbddirty(bh); > } > > if (buffer_jbddirty(bh)) { > JBUFFER_TRACE(jh, "add to new checkpointing trans"); > __journal_insert_checkpoint(jh, commit_transaction); > JBUFFER_TRACE(jh, "refile for checkpoint writeback"); > __journal_refile_buffer(jh); > jbd_unlock_bh_state(bh); > } else { > J_ASSERT_BH(bh, !buffer_dirty(bh)); > ... > JBUFFER_TRACE(jh, "refile or unfile freed buffer"); > __journal_refile_buffer(jh); > if (!jh->b_transaction) { > jbd_unlock_bh_state(bh); > /* needs a brelse / > journal_remove_journal_head(bh); > release_buffer_page(bh); > ^^^^^^^^^^^^^^^^^^^^^^^^ > } else > } *************************************************************** * Apply the code of "^^^^^^" lines into 'JBD: commit phase 2' * **************************************************************** At journal_commit_transaction() code, there is one extra message in the series of jbd debug messages. ("JBD: commit phase 2") This patch fixes it, too. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Adrian Bunk	a10320e8f7	jbd: unexport journal_update_superblock Remove the unused EXPORT_SYMBOL(journal_update_superblock). Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Duane Griffin	3ccc3167b0	ext3: handle deleting corrupted indirect blocks While freeing indirect blocks we attach a journal head to the parent buffer head, free the blocks, then journal the parent. If the indirect block list is corrupted and points to the parent the journal head will be detached when the block is cleared, causing an OOPS. Check for that explicitly and handle it gracefully. This patch fixes the third case (image hdb.20000057.nullderef.gz) reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882. Immediately above the change, in the ext3_free_data function, we call ext3_clear_blocks to clear the indirect blocks in this parent block. If one of those blocks happens to actually be the parent block it will clear b_private / BH_JBD. I did the check at the end rather than earlier as it seemed more elegant. I don't think there should be much practical difference, although it is possible the FS may not be quite so badly corrupted if we did it the other way (and didn't clear the block at all). To be honest, I'm not convinced there aren't other similar failure modes lurking in this code, although I couldn't find any with a quick review. [akpm@linux-foundation.org: fix printk warning] Signed-off-by: Duane Griffin <duaneg@dghda.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Hidehiro Kawai	95450f5a7e	ext3: don't read inode block if the buffer has a write error A transient I/O error can corrupt inode data. Here is the scenario: (1) update inode_A at the block_B (2) pdflush writes out new inode_A to the filesystem, but it results in write I/O error, at this point, BH_Uptodate flag of the buffer for block_B is cleared and BH_Write_EIO is set (3) create new inode_C which located at block_B, and __ext3_get_inode_loc() tries to read on-disk block_B because the buffer is not uptodate (4) if it can read on-disk block_B successfully, inode_A is overwritten by old data This patch makes __ext3_get_inode_loc() not read the inode block if the buffer has BH_Write_EIO flag. In this case, the buffer should have the latest information, so setting the uptodate flag to the buffer (this avoids WARN_ON_ONCE() in mark_buffer_dirty().) According to this change, we would need to test BH_Write_EIO flag for the error checking. Currently nobody checks write I/O errors on metadata buffers, but it will be done in other patches I'm working on. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: sugita <yumiko.sugita.yf@hitachi.com> Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Jan Kara <jack@ucw.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Duane Griffin	ae76dd9a6b	ext3: handle corrupted orphan list at mount If the orphan node list includes valid, untruncatable nodes with nlink > 0 the ext3_orphan_cleanup loop which attempts to delete them will not do so, causing it to loop forever. Fix by checking for such nodes in the ext3_orphan_get function. This patch fixes the second case (image hdb.20000009.softlockup.gz) reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: printk warning fix] Signed-off-by: Duane Griffin <duaneg@dghda.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Shen Feng	ef1afd3951	ext3: remove double definitions of xattr macros remove the definitions of macros: XATTR_TRUSTED_PREFIX XATTR_USER_PREFIX since they are defined in linux/xattr.h Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Mingming Cao	3f31fddfa2	jbd: fix race between free buffer and commit transaction journal_try_to_free_buffers() could race with jbd commit transaction when the later is holding the buffer reference while waiting for the data buffer to flush to disk. If the caller of journal_try_to_free_buffers() request tries hard to release the buffers, it will treat the failure as error and return back to the caller. We have seen the directo IO failed due to this race. Some of the caller of releasepage() also expecting the buffer to be dropped when passed with GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers(). With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to indicating this call could wait, in case of try_to_free_buffers() failed, let's waiting for journal_commit_transaction() to finish commit the current committing transaction, then try to free those buffers again. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Shen Feng	9ebfbe9f92	ext3: improve some code in rb tree part of dir.c - remove unnecessary code in free_rb_tree_fname - rename free_rb_tree_fname to ext3_htree_create_dir_info since it and ext3_htree_free_dir_info are a pair - replace kmalloc with kzalloc in ext3_htree_free_dir_info Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Duane Griffin	1984bb763c	jbd: tidy up revoke cache initialisation and destruction Make revocation cache destruction safe to call if initialisation fails partially or entirely. This allows it to be used to cleanup in the case of initialisation failure, simplifying that code slightly. Signed-off-by: Duane Griffin <duaneg@dghda.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Duane Griffin	f4d79ca2fa	jbd: eliminate duplicated code in revocation table init/destroy functions The revocation table initialisation/destruction code is repeated for each of the two revocation tables stored in the journal. Refactoring the duplicated code into functions is tidier, simplifies the logic in initialisation in particular, and slightly reduces the code size. There should not be any functional change. Signed-off-by: Duane Griffin <duaneg@dghda.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Duane Griffin	3850f7a521	jbd: replace potentially false assertion with if block If an error occurs during jbd cache initialisation it is possible for the journal_head_cache to be NULL when journal_destroy_journal_head_cache is called. Replace the J_ASSERT with an if block to handle the situation correctly. Note that even with this fix things will break badly if jbd is statically compiled in and cache initialisation fails. Signed-off-by: Duane Griffin <duaneg@dghda.com Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Jan Kara	d06bf1d252	ext3: correct mount option parsing to detect when quota options can be changed We should not allow user to change quota mount options when quota is just suspended. I would make mount options and internal quota state inconsistent. Also we should not allow user to change quota format when quota is turned on. On the other hand we can just silently ignore when some option is set to the value it already has (mount does this on remount). Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:32 -07:00
Jan Kara	99aeaf639f	ext3: fix typos in messages and comments (journalled -> journaled) Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:31 -07:00
Jan Kara	9cfe7b9010	ext3: fix synchronization of quota files in journal=data mode In journal=data mode, it is not enough to do write_inode_now as done in vfs_quota_on() to write all data to their final location (which is needed for quota_read to work correctly). Calling journal_flush() does its job. Reported-by: Nick <gentuu@gmail.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:31 -07:00
Shen Feng	f905f06fca	ext2: remove double definitions of xattr macros remove the definitions of macros: XATTR_TRUSTED_PREFIX XATTR_USER_PREFIX since they are defined in linux/xattr.h Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:31 -07:00
Adrian Bunk	fb523f3227	minix: remove !NO_TRUNCATE code This patch removes the !NO_TRUNCATE code that anyway required a manual editing of the code for being used. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:30 -07:00
Hugh Dickins	ba92a43dba	exec: remove some includes fs/exec.c used to need mman.h pagemap.h swap.h and rmap.h when it did mm-ish stuff in install_arg_page(); but no need for them after 2.6.22. [akpm@linux-foundation.org: unbreak arm] Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:28 -07:00
Harvey Harrison	b7bbf8fa6b	fs: ldm.[ch] use get_unaligned_* helpers Replace the private BE16/BE32/BE64 macros with direct calls to get_unaligned_be16/32/64. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:26 -07:00
David Woodhouse	ff877ea80e	Merge branch 'linux-next' of git://git.infradead.org/~dedekind/ubi-2.6	2008-07-25 10:40:14 -04:00
Nathan Lynch	483fad1c3f	ELF loader support for auxvec base platform string Some IBM POWER-based platforms have the ability to run in a mode which mostly appears to the OS as a different processor from the actual hardware. For example, a Power6 system may appear to be a Power5+, which makes the AT_PLATFORM value "power5+". This means that programs are restricted to the ISA supported by Power5+; Power6-specific instructions are treated as illegal. However, some applications (virtual machines, optimized libraries) can benefit from knowledge of the underlying CPU model. A new aux vector entry, AT_BASE_PLATFORM, will denote the actual hardware. For example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM will be "power5+" and AT_BASE_PLATFORM will be "power6". The idea is that AT_PLATFORM indicates the instruction set supported, while AT_BASE_PLATFORM indicates the underlying microarchitecture. If the architecture has defined ELF_BASE_PLATFORM, copy that value to the user stack in the same manner as ELF_PLATFORM. Signed-off-by: Nathan Lynch <ntl@pobox.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2008-07-25 15:44:39 +10:00
Adrian Bunk	fb2e405fc1	fix fs/nfs/nfsroot.c compilation This fixes the following compile error caused by commit `f9247273cb` ("UFS: add const to parser token table"): CC fs/nfs/nfsroot.o /home/bunk/linux/kernel-2.6/git/linux-2.6/fs/nfs/nfsroot.c:130: error: tokens causes a section type conflict make[3]: *** [fs/nfs/nfsroot.o] Error 1 Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 17:32:41 -07:00
Chris Wright	e2d2867ff8	When verifying the decoded header before decoding the object identifier (expecting a SPNEGO pseudo-mechanism oid), the test to verify it is a primitive encoding is compared against the asn1 class. Primitive is not a class. This brings check in line with similar check for krb/ntlmssp oid. Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 20:43:34 +00:00
Steven Whitehouse	f9247273cb	UFS: add const to parser token table This patch adds a "const" to the parser token table. I've done an allmodconfig build to see if this produces any warnings/failures and the patch includes a fix for the only warning that was produced. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Alexander Viro <aviro@redhat.com> Acked-by: Evgeniy Dushistov <dushistov@mail.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 11:50:15 -07:00
Ian Kent	aa55ddf340	autofs4: remove unused ioctls The ioctls AUTOFS_IOC_TOGGLEREGHOST and AUTOFS_IOC_ASKREGHOST were added several years ago but what they were intended for has never been implemented (as far as I'm aware noone uses them) so remove them. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:33 -07:00
Ian Kent	06a3598552	autofs4: reorganize expire pending wait function calls This patch re-orgnirzes the checking for and waiting on active expires and elininates redundant checks. Signed-off-by: Ian Kent <raven@themaw.net> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	ec6e8c7d3f	autofs4: fix direct mount pending expire race - correction Appologies, somehow I seem to have sent an out dated version of this patch. Here is an additional patch that brings the patch up to date. Signed-off-by: Ian Kent <raven@themaw.net> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	6e60a9ab5f	autofs4: fix direct mount pending expire race For direct and offset type mounts that are covered by another mount we cannot check the AUTOFS_INF_EXPIRING flag during a path walk which leads to lookups walking into an expiring mount while it is being expired. For example, for the direct multi-mount map entry with a couple of offsets: /race/mm1 / <server1>:/<path1> /om1 <server2>:/<path2> /om2 <server1>:/<path3> an autofs trigger mount is mounted on /race/mm1 and when accessed it is over mounted and trigger mounts made for /race/mm1/om1 and /race/mm1/om2. So it isn't possible for path walks to see the expiring flag at all and they happily walk into the file system while it is expiring. When expiring these mounts follow_down() must stop at the autofs mount and all processes must block in the ->follow_link() method (except the daemon) until the expire is complete. This is done by decrementing the d_mounted field of the autofs trigger mount root dentry until the expire is completed. In ->follow_link() all processes wait on the expire and the mount following is completed for the daemon until the expire is complete. Signed-off-by: Ian Kent <raven@themaw.net> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	97e7449a7a	autofs4: fix indirect mount pending expire race The selection of a dentry for expiration and the setting of the AUTOFS_INF_EXPIRING flag isn't done atomically which can lead to lookups walking into an expiring mount. What happens is that an expire is initiated by the daemon and a dentry is selected for expire but, since there is no lock held between the selection and setting of the expiring flag, a process may find the flag clear and continue walking into the mount tree at the same time the daemon attempts the expire it. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	26e81b3142	autofs4: fix pending checks There are two cases for which a dentry that has a pending mount request does not wait for completion. One is via autofs4_revalidate() and the other via autofs4_follow_link(). In revalidate, after the mount point directory is created, but before the mount is done, the check in try_to_fill_dentry() can can fail to send the dentry to the wait queue since the dentry is positive and the lookup flags may contain only LOOKUP_FOLLOW. Although we don't trigger a mount for the LOOKUP_FOLLOW flag, if ther's one pending we might as well wait and use the mounted dentry for the lookup. In autofs4_follow_link() the dentry is not checked to see if it is pending so it may fail to call try_to_fill_dentry() and not wait for mount completion. A dentry that is pending must always be sent to the wait queue. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	ff9cd499d6	autofs4: cleanup redundant readir code The mount triggering functionality of readdir and related functions is no longer used (and is quite broken as well). The unused portions have been removed. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	c72305b547	autofs4: indirect dentry must almost always be positive We have been seeing mount requests comming to the automount daemon for keys of the form "<map key>/<non key directory>" which are lookups for invalid map keys. But we can check for this in the kernel module and return a fail immediately, without having to send a request to the daemon. It is possible to recognise these requests are invalid based on whether the request dentry is negative and its relation to the autofs file system root. For example, given the indirect multi-mount map entry: idm1 \ /mm1 <server>:/<path1> /mm2 <server>:/<path2> For a request to mount idm1, IS_ROOT((idm1)->d_parent) will be always be true and the dentry may be negative. But directories idm1/mm1 and idm1/mm2 will always be created as part of the mount request for idm1. So any mount request within idm1 itself must have a positive dentry otherwise the map key is invalid. In version 4 these multi-mount entries are all mounted and umounted as a single request and in version 5 the directories idm1/mm1 and idm1/mm2 are created and an autofs fs mounted on them to act as a mount trigger so the above is also true. This also holds true for the autofs version 4 pseudo direct mount feature. When this feature is used without the "--ghost" option automount(8) will create internal submounts as we go down the map key paths which are essentially normal indirect mounts for which the above holds. If the "--ghost" option is given the directories for map keys are created at daemon startup so valid map entries correspond to postive dentries in the autofs fs. autofs version 5 direct mount maps are similar except that the IS_ROOT check is not needed. This has been addressed in a previous patch tittled "autofs4 - detect invalid direct mount requests". For example, given the direct multi-mount map entry: /test/dm1 \ /mm1 <server>:/<path1> /mm2 <server>:/<path2> An autofs fs is mounted on /test/dm1 as a trigger mount and when a mount is triggered for /test/dm1, the multi-mount offset directories /test/dm1/mm1 and /test/dm1/mm2 are created and an autofs fs is mounted on them to act as mount triggers. So valid direct mount requests must always have a positive dentry if they correspond to a valid map entry. Signed-off-by: Ian Kent <raven@themaw.net> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	eb3b176796	autofs4: detect invalid direct mount requests autofs v5 direct and offset mounts within an autofs filesystem are triggered by existing autofs triger mounts so the mount point dentry must be positive. If the mount point dentry is negative then the trigger doesn't exist so we can return fail immediately. Signed-off-by: Ian Kent <raven@themaw.net> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	296f7bf78b	autofs4: fix waitq memory leak If an autofs mount becomes catatonic before autofs4_wait_release() is called the wait queue counter will not be decremented down to zero and the entry will never be freed. There are also races decrementing the wait counter in the wait release function. To deal with this the counter needs to be updated while holding the wait queue mutex and waiters need to be woken up unconditionally when the wait is removed from the queue to ensure we eventually free the wait. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	e64be33cca	autofs4: check kernel communication pipe is valid for write It is possible for an autofs mount to become catatonic (and for the daemon communication pipe to become NULL) after a wait has been initiallized but before the request has been sent to the daemon. We need to check for this before sending the request packet. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	f4c7da0261	autofs4: add missing kfree It see that the patch tittled "autofs4 - fix pending mount race" is missing a change that I had recently made. It's missing a kfree for the case mutex_lock_interruptible() fails to aquire the wait queue mutex. Signed-off-by: Ian Kent <raven@themaw.net> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	a1362fe92f	autofs4: fix pending mount race Close a race between a pending mount that is about to finish and a new lookup for the same directory. Process P1 triggers a mount of directory foo. It sets DCACHE_AUTOFS_PENDING in the ->lookup routine, creates a waitq entry for 'foo', and calls out to the daemon to perform the mount. The autofs daemon will then create the directory 'foo', using a new dentry that will be hashed in the dcache. Before the mount completes, another process, P2, tries to walk into the 'foo' directory. The vfs path walking code finds an entry for 'foo' and calls the revalidate method. Revalidate finds that the entry is not PENDING (because PENDING was never set on the dentry created by the mkdir), but it does find the directory is empty. Revalidate calls try_to_fill_dentry, which sets the PENDING flag and then calls into the autofs4 wait code to trigger or wait for a mount of 'foo'. The wait code finds the entry for 'foo' and goes to sleep waiting for the completion of the mount. Yet another process, P3, tries to walk into the 'foo' directory. This process again finds a dentry in the dcache for 'foo', and calls into the autofs revalidate code. The revalidate code finds that the PENDING flag is set, and so calls try_to_fill_dentry. a) try_to_fill_dentry sets the PENDING flag redundantly for this dentry, then calls into the autofs4 wait code. b) the autofs4 wait code takes the waitq mutex and searches for an entry for 'foo' Between a and b, P1 is woken up because the mount completed. P1 takes the wait queue mutex, clears the PENDING flag from the dentry, and removes the waitqueue entry for 'foo' from the list. When it releases the waitq mutex, P3 (eventually) acquires it. At this time, it looks for an existing waitq for 'foo', finds none, and so creates a new one and calls out to the daemon to mount the 'foo' directory. Now, the reason that three processes are required to trigger this race is that, because the PENDING flag is not set on the dentry created by mkdir, the window for the race would be way to slim for it to ever occur. Basically, between the testing of d_mountpoint(dentry) and the taking of the waitq mutex, the mount would have to complete and the daemon would have to be woken up, and that in turn would have to wake up P1. This is simply impossible. Add the third process, though, and it becomes slightly more likely. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	5a11d4d0ee	autofs4: fix waitq locking The autofs4_catatonic_mode() function accesses the wait queue without any locking but can be called at any time. This could lead to a possible double free of the name field of the wait and a double fput of the daemon communication pipe or an fput of a NULL file pointer. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Jeff Moyer	70b52a0a50	autofs4: use struct qstr in waitq.c The autofs_wait_queue already contains all of the fields of the struct qstr, so change it into a qstr. This patch, from Jeff Moyer, has been modified a liitle by myself. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:32 -07:00
Ian Kent	6d5cb926fa	autofs4: use lookup intent flags to trigger mounts When an open(2) call is made on an autofs mount point directory that already exists and the O_DIRECTORY flag is not used the needed mount callback to the daemon is not done. This leads to the path walk continuing resulting in a callback to the daemon with an incorrect key. open(2) is called without O_DIRECTORY by the "find" utility but this should be handled properly anyway. This happens because autofs needs to use the lookup flags to decide when to callback to the daemon to perform a mount to prevent mount storms. For example, an autofs indirect mount map that has the "browse" option will have the mount point directories are pre-created and the stat(2) call made by a color ls against each directory will cause all these directories to be mounted. It is unfortunate we need to resort to this but mount maps can be quite large. Additionally, if a user manually umounts an autofs indirect mount the directory isn't removed which also leads to this situation. To resolve this autofs needs to use the lookup intent flags to enable it to make this decision. This patch adds this check and triggers a call back if any of the lookup intent flags are set as all these calls warrant a mount attempt be requested. I know that external VFS code which uses the lookup flags is something that the VFS would like to eliminate but I have no choice as I can't see any other way to do this. A VFS dentry or inode operation callback which returns the lookup "type" (requires a definition) would be sufficient. But this change is needed now and I'm not aware of the form that coming VFS changes will take so I'm not willing to propose anything along these lines. If anyone can provide an alternate method I would be happy to use it. [akpm@linux-foundation.org: fix build for concurrent VFS changes] Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Ian Kent	c432c2586a	autofs4: don't release directory mutex if called in oz_mode Since we now delay hashing of dentrys until the ->mkdir() call, droping and re-taking the directory mutex within the ->lookup() function when we are being called by user space is not needed. This can lead to a race when other processes are attempting to access the same directory during mount point directory creation. In this case we need to hang onto the mutex to ensure we don't get user processes trying to create a mount request for a newly created dentry after the mount point entry has already been created. This ensures that when we need to check a dentry passed to autofs4_wait(), if it is hashed, it is always the mount point dentry and not a new dentry created by another lookup during directory creation. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Ian Kent	ef581a7428	autofs4: fix symlink name allocation The length of the symlink name has been moved but it needs to be set before allocating space for it in the dentry info struct. This corrects a mistake in a recent patch. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Ian Kent	2576737873	autofs4: use look aside list for lookups A while ago a patch to resolve a deadlock during directory creation was merged. This delayed the hashing of lookup dentrys until the ->mkdir() (or ->symlink()) operation completed to ensure we always went through ->lookup() instead of also having processes go through ->revalidate() so our VFS locking remained consistent. Now we are seeing a couple of side affects of that change in situations with heavy mount activity. Two cases have been identified: 1) When a mount request is triggered, due to the delayed hashing, the directory created by user space for the mount point doesn't have the DCACHE_AUTOFS_PENDING flag set. In the case of an autofs multi-mount where a tree of mount point directories are created this can lead to the path walk continuing rather than the dentry being sent to the wait queue to wait for request completion. This is because, if the pending flag isn't set, the criteria for deciding this is a mount in progress fails to hold, namely that the dentry is not a mount point and has no subdirectories. 2) A mount request dentry is initially created negative and unhashed. It remains this way until the ->mkdir() callback completes. Since it is unhashed a fresh dentry is used when the user space mount request creates the mount point directory. This leaves the original dentry negative and unhashed. But revalidate has no way to tell the VFS that the dentry has changed, other than to force another ->lookup() by returning false, which is at best wastefull and at worst not possible. This results in an -ENOENT return from the original path walk when in fact the mount succeeded. To resolve this we need to ensure that the same dentry is used in all calls to ->lookup() during the course of a mount request. This patch achieves that by adding the initial dentry to a look aside list and removes it at ->mkdir() or ->symlink() completion (or when the dentry is released), since these are the only create operations autofs4 supports. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Ian Kent	caf7da3d5d	autofs4: revert - redo lookup in ttfd This patch series enables the use of a single dentry for lookups prior to the dentry being hashed and so we no longer need to redo the lookup. This patch reverts the patch of commit `033790449b`. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Ian Kent	5f6f4f28b6	autofs4: don't make expiring dentry negative Correct the error of making a positive dentry negative after it has been instantiated. The code that makes this error attempts to re-use the dentry from a concurrent expire and mount to resolve a race and the dentry used for the lookup must be negative for mounts to trigger in the required cases. The fact is that the dentry doesn't need to be re-used because all that is needed is to preserve the flag that indicates an expire is still incomplete at the time of the mount request. This change uses the the dentry to check the flag and wait for the expire to complete then discards it instead of attempting to re-use it. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Michael Halcrow	391b52f98c	eCryptfs: Make all persistent file opens delayed There is no good reason to immediately open the lower file, and that can cause problems with files that the user does not intend to immediately open, such as device nodes. This patch removes the persistent file open from the interpose step and pushes that to the locations where eCryptfs really does need the lower persistent file, such as just before reading or writing the metadata stored in the lower file header. Two functions are jumping to out_dput when they should just be jumping to out on error paths. This patch also fixes these. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Michael Halcrow	72b55fffd6	eCryptfs: do not try to open device files on mknod When creating device nodes, eCryptfs needs to delay actually opening the lower persistent file until an application tries to open. Device handles may not be backed by anything when they first come into existence. [Valdis.Kletnieks@vt.edu: build fix] Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: <Valdis.Kletnieks@vt.edu} Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Harvey Harrison	0a688ad713	ecryptfs: inode.c mmap.c use unaligned byteorder helpers Fixe sparse warnings: fs/ecryptfs/inode.c:368:15: warning: cast to restricted __be64 fs/ecryptfs/mmap.c:385:12: warning: incorrect type in assignment (different base types) fs/ecryptfs/mmap.c:385:12: expected unsigned long long [unsigned] [assigned] [usertype] file_size fs/ecryptfs/mmap.c:385:12: got restricted __be64 [usertype] <noident> fs/ecryptfs/mmap.c:428:12: warning: incorrect type in assignment (different base types) fs/ecryptfs/mmap.c:428:12: expected unsigned long long [unsigned] [assigned] [usertype] file_size fs/ecryptfs/mmap.c:428:12: got restricted __be64 [usertype] <noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Harvey Harrison	29335c6a41	ecryptfs: crypto.c use unaligned byteorder helpers Fixes the following sparse warnings: fs/ecryptfs/crypto.c:1036:8: warning: cast to restricted __be32 fs/ecryptfs/crypto.c:1038:8: warning: cast to restricted __be32 fs/ecryptfs/crypto.c:1077:10: warning: cast to restricted __be32 fs/ecryptfs/crypto.c:1103:6: warning: incorrect type in assignment (different base types) fs/ecryptfs/crypto.c:1105:6: warning: incorrect type in assignment (different base types) fs/ecryptfs/crypto.c:1124:8: warning: incorrect type in assignment (different base types) fs/ecryptfs/crypto.c:1241:21: warning: incorrect type in assignment (different base types) fs/ecryptfs/crypto.c:1244:30: warning: incorrect type in assignment (different base types) fs/ecryptfs/crypto.c:1414:23: warning: cast to restricted __be32 fs/ecryptfs/crypto.c:1417:32: warning: cast to restricted __be16 Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Miklos Szeredi	8f2368095e	ecryptfs: string copy cleanup Clean up overcomplicated string copy, which also gets rid of this bogus warning: fs/ecryptfs/main.c: In function 'ecryptfs_parse_options': include/asm/arch/string_32.h:75: warning: array subscript is above array bounds Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Eric Sandeen	982363c97f	ecryptfs: propagate key errors up at mount time Mounting with invalid key signatures should probably fail, if they were specifically requested but not available. Also fix case checks in process_request_key_err() for the right sign of the errnos, as spotted by Jan Tluka. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Tluka <jtluka@redhat.com> Acked-by: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Tyler Hicks	6c4c17b073	ecryptfs: discard ecryptfsd registration messages in miscdev The userspace eCryptfs daemon sends HELO and QUIT messages to the kernel for per-user daemon (un)registration. These messages are required when netlink is used as the transport, but (un)registration is handled by opening and closing the device file when miscdev is the transport. These messages should be discarded in the miscdev transport so that a daemon isn't registered twice. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Michael Halcrow	746f1e558b	eCryptfs: Privileged kthread for lower file opens eCryptfs would really like to have read-write access to all files in the lower filesystem. Right now, the persistent lower file may be opened read-only if the attempt to open it read-write fails. One way to keep from having to do that is to have a privileged kthread that can open the lower persistent file on behalf of the user opening the eCryptfs file; this patch implements this functionality. This patch will properly allow a less-privileged user to open the eCryptfs file, followed by a more-privileged user opening the eCryptfs file, with the first user only being able to read and the second user being able to both read and write. eCryptfs currently does this wrong; it will wind up calling vfs_write() on a file that was opened read-only. This is fixed in this patch. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:30 -07:00
Ulrich Drepper	9fe5ad9c8c	flag parameters add-on: remove epoll_create size param Remove the size parameter from the new epoll_create syscall and renames the syscall itself. The updated test program follows. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_epoll_create2 # ifdef __x86_64__ # define __NR_epoll_create2 291 # elif defined __i386__ # define __NR_epoll_create2 329 # else # error "need __NR_epoll_create2" # endif #endif #define EPOLL_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_epoll_create2, 0); if (fd == -1) { puts ("epoll_create2(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("epoll_create2(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_epoll_create2, EPOLL_CLOEXEC); if (fd == -1) { puts ("epoll_create2(EPOLL_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	e38b36f325	flag parameters: check magic constants This patch adds test that ensure the boundary conditions for the various constants introduced in the previous patches is met. No code is generated. [akpm@linux-foundation.org: fix alpha] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	510df2dd48	flag parameters: NONBLOCK in inotify_init This patch adds non-blocking support for inotify_init1. The additional changes needed are minimal. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_inotify_init1 # ifdef __x86_64__ # define __NR_inotify_init1 294 # elif defined __i386__ # define __NR_inotify_init1 332 # else # error "need __NR_inotify_init1" # endif #endif #define IN_NONBLOCK O_NONBLOCK int main (void) { int fd = syscall (__NR_inotify_init1, 0); if (fd == -1) { puts ("inotify_init1(0) failed"); return 1; } int fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { puts ("inotify_init1(0) set non-blocking mode"); return 1; } close (fd); fd = syscall (__NR_inotify_init1, IN_NONBLOCK); if (fd == -1) { puts ("inotify_init1(IN_NONBLOCK) failed"); return 1; } fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { puts ("inotify_init1(IN_NONBLOCK) set non-blocking mode"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	be61a86d72	flag parameters: NONBLOCK in pipe This patch adds O_NONBLOCK support to pipe2. It is minimally more involved than the patches for eventfd et.al but still trivial. The interfaces of the create_write_pipe and create_read_pipe helper functions were changed and the one other caller as well. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_pipe2 # ifdef __x86_64__ # define __NR_pipe2 293 # elif defined __i386__ # define __NR_pipe2 331 # else # error "need __NR_pipe2" # endif #endif int main (void) { int fds[2]; if (syscall (__NR_pipe2, fds, 0) == -1) { puts ("pipe2(0) failed"); return 1; } for (int i = 0; i < 2; ++i) { int fl = fcntl (fds[i], F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { printf ("pipe2(0) set non-blocking mode for fds[%d]\n", i); return 1; } close (fds[i]); } if (syscall (__NR_pipe2, fds, O_NONBLOCK) == -1) { puts ("pipe2(O_NONBLOCK) failed"); return 1; } for (int i = 0; i < 2; ++i) { int fl = fcntl (fds[i], F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { printf ("pipe2(O_NONBLOCK) does not set non-blocking mode for fds[%d]\n", i); return 1; } close (fds[i]); } puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	6b1ef0e60d	flag parameters: NONBLOCK in timerfd_create This patch adds support for the TFD_NONBLOCK flag to timerfd_create. The additional changes needed are minimal. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_timerfd_create # ifdef __x86_64__ # define __NR_timerfd_create 283 # elif defined __i386__ # define __NR_timerfd_create 322 # else # error "need __NR_timerfd_create" # endif #endif #define TFD_NONBLOCK O_NONBLOCK int main (void) { int fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, 0); if (fd == -1) { puts ("timerfd_create(0) failed"); return 1; } int fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { puts ("timerfd_create(0) set non-blocking mode"); return 1; } close (fd); fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, TFD_NONBLOCK); if (fd == -1) { puts ("timerfd_create(TFD_NONBLOCK) failed"); return 1; } fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { puts ("timerfd_create(TFD_NONBLOCK) set non-blocking mode"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	e7d476dfdf	flag parameters: NONBLOCK in eventfd This patch adds support for the EFD_NONBLOCK flag to eventfd2. The additional changes needed are minimal. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_eventfd2 # ifdef __x86_64__ # define __NR_eventfd2 290 # elif defined __i386__ # define __NR_eventfd2 328 # else # error "need __NR_eventfd2" # endif #endif #define EFD_NONBLOCK O_NONBLOCK int main (void) { int fd = syscall (__NR_eventfd2, 1, 0); if (fd == -1) { puts ("eventfd2(0) failed"); return 1; } int fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { puts ("eventfd2(0) sets non-blocking mode"); return 1; } close (fd); fd = syscall (__NR_eventfd2, 1, EFD_NONBLOCK); if (fd == -1) { puts ("eventfd2(EFD_NONBLOCK) failed"); return 1; } fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { puts ("eventfd2(EFD_NONBLOCK) does not set non-blocking mode"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	5fb5e04926	flag parameters: NONBLOCK in signalfd This patch adds support for the SFD_NONBLOCK flag to signalfd4. The additional changes needed are minimal. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_signalfd4 # ifdef __x86_64__ # define __NR_signalfd4 289 # elif defined __i386__ # define __NR_signalfd4 327 # else # error "need __NR_signalfd4" # endif #endif #define SFD_NONBLOCK O_NONBLOCK int main (void) { sigset_t ss; sigemptyset (&ss); sigaddset (&ss, SIGUSR1); int fd = syscall (__NR_signalfd4, -1, &ss, 8, 0); if (fd == -1) { puts ("signalfd4(0) failed"); return 1; } int fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { puts ("signalfd4(0) set non-blocking mode"); return 1; } close (fd); fd = syscall (__NR_signalfd4, -1, &ss, 8, SFD_NONBLOCK); if (fd == -1) { puts ("signalfd4(SFD_NONBLOCK) failed"); return 1; } fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { puts ("signalfd4(SFD_NONBLOCK) does not set non-blocking mode"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	99829b8329	flag parameters: NONBLOCK in anon_inode_getfd Building on the previous change to anon_inode_getfd, this patch introduces support for handling of O_NONBLOCK in addition to the already supported O_CLOEXEC. Following patches will take advantage of this support. As can be seen, the additional support for supporting this functionality is minimal. Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:28 -07:00
Ulrich Drepper	4006553b06	flag parameters: inotify_init This patch introduces the new syscall inotify_init1 (note: the 1 stands for the one parameter the syscall takes, as opposed to no parameter before). The values accepted for this parameter are function-specific and defined in the inotify.h header. Here the values must match the O_* flags, though. In this patch CLOEXEC support is introduced. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_inotify_init1 # ifdef __x86_64__ # define __NR_inotify_init1 294 # elif defined __i386__ # define __NR_inotify_init1 332 # else # error "need __NR_inotify_init1" # endif #endif #define IN_CLOEXEC O_CLOEXEC int main (void) { int fd; fd = syscall (__NR_inotify_init1, 0); if (fd == -1) { puts ("inotify_init1(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("inotify_init1(0) set close-on-exit"); return 1; } close (fd); fd = syscall (__NR_inotify_init1, IN_CLOEXEC); if (fd == -1) { puts ("inotify_init1(IN_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("inotify_init1(O_CLOEXEC) does not set close-on-exit"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:28 -07:00
Ulrich Drepper	ed8cae8ba0	flag parameters: pipe This patch introduces the new syscall pipe2 which is like pipe but it also takes an additional parameter which takes a flag value. This patch implements the handling of O_CLOEXEC for the flag. I did not add support for the new syscall for the architectures which have a special sys_pipe implementation. I think the maintainers of those archs have the chance to go with the unified implementation but that's up to them. The implementation introduces do_pipe_flags. I did that instead of changing all callers of do_pipe because some of the callers are written in assembler. I would probably screw up changing the assembly code. To avoid breaking code do_pipe is now a small wrapper around do_pipe_flags. Once all callers are changed over to do_pipe_flags the old do_pipe function can be removed. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_pipe2 # ifdef __x86_64__ # define __NR_pipe2 293 # elif defined __i386__ # define __NR_pipe2 331 # else # error "need __NR_pipe2" # endif #endif int main (void) { int fd[2]; if (syscall (__NR_pipe2, fd, 0) != 0) { puts ("pipe2(0) failed"); return 1; } for (int i = 0; i < 2; ++i) { int coe = fcntl (fd[i], F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { printf ("pipe2(0) set close-on-exit for fd[%d]\n", i); return 1; } } close (fd[0]); close (fd[1]); if (syscall (__NR_pipe2, fd, O_CLOEXEC) != 0) { puts ("pipe2(O_CLOEXEC) failed"); return 1; } for (int i = 0; i < 2; ++i) { int coe = fcntl (fd[i], F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { printf ("pipe2(O_CLOEXEC) does not set close-on-exit for fd[%d]\n", i); return 1; } } close (fd[0]); close (fd[1]); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:28 -07:00
Ulrich Drepper	336dd1f70f	flag parameters: dup2 This patch adds the new dup3 syscall. It extends the old dup2 syscall by one parameter which is meant to hold a flag value. Support for the O_CLOEXEC flag is added in this patch. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_dup3 # ifdef __x86_64__ # define __NR_dup3 292 # elif defined __i386__ # define __NR_dup3 330 # else # error "need __NR_dup3" # endif #endif int main (void) { int fd = syscall (__NR_dup3, 1, 4, 0); if (fd == -1) { puts ("dup3(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("dup3(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_dup3, 1, 4, O_CLOEXEC); if (fd == -1) { puts ("dup3(O_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("dup3(O_CLOEXEC) set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:28 -07:00
Ulrich Drepper	a0998b50c3	flag parameters: epoll_create This patch adds the new epoll_create2 syscall. It extends the old epoll_create syscall by one parameter which is meant to hold a flag value. In this patch the only flag support is EPOLL_CLOEXEC which causes the close-on-exec flag for the returned file descriptor to be set. A new name EPOLL_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_epoll_create2 # ifdef __x86_64__ # define __NR_epoll_create2 291 # elif defined __i386__ # define __NR_epoll_create2 329 # else # error "need __NR_epoll_create2" # endif #endif #define EPOLL_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_epoll_create2, 1, 0); if (fd == -1) { puts ("epoll_create2(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("epoll_create2(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_epoll_create2, 1, EPOLL_CLOEXEC); if (fd == -1) { puts ("epoll_create2(EPOLL_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:28 -07:00
Ulrich Drepper	11fcb6c146	flag parameters: timerfd_create The timerfd_create syscall already has a flags parameter. It just is unused so far. This patch changes this by introducing the TFD_CLOEXEC flag to set the close-on-exec flag for the returned file descriptor. A new name TFD_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_timerfd_create # ifdef __x86_64__ # define __NR_timerfd_create 283 # elif defined __i386__ # define __NR_timerfd_create 322 # else # error "need __NR_timerfd_create" # endif #endif #define TFD_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, 0); if (fd == -1) { puts ("timerfd_create(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("timerfd_create(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, TFD_CLOEXEC); if (fd == -1) { puts ("timerfd_create(TFD_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("timerfd_create(TFD_CLOEXEC) set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Ulrich Drepper	b087498eb5	flag parameters: eventfd This patch adds the new eventfd2 syscall. It extends the old eventfd syscall by one parameter which is meant to hold a flag value. In this patch the only flag support is EFD_CLOEXEC which causes the close-on-exec flag for the returned file descriptor to be set. A new name EFD_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_eventfd2 # ifdef __x86_64__ # define __NR_eventfd2 290 # elif defined __i386__ # define __NR_eventfd2 328 # else # error "need __NR_eventfd2" # endif #endif #define EFD_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_eventfd2, 1, 0); if (fd == -1) { puts ("eventfd2(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("eventfd2(0) sets close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_eventfd2, 1, EFD_CLOEXEC); if (fd == -1) { puts ("eventfd2(EFD_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("eventfd2(EFD_CLOEXEC) does not set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Ulrich Drepper	9deb27baed	flag parameters: signalfd This patch adds the new signalfd4 syscall. It extends the old signalfd syscall by one parameter which is meant to hold a flag value. In this patch the only flag support is SFD_CLOEXEC which causes the close-on-exec flag for the returned file descriptor to be set. A new name SFD_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_signalfd4 # ifdef __x86_64__ # define __NR_signalfd4 289 # elif defined __i386__ # define __NR_signalfd4 327 # else # error "need __NR_signalfd4" # endif #endif #define SFD_CLOEXEC O_CLOEXEC int main (void) { sigset_t ss; sigemptyset (&ss); sigaddset (&ss, SIGUSR1); int fd = syscall (__NR_signalfd4, -1, &ss, 8, 0); if (fd == -1) { puts ("signalfd4(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("signalfd4(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_signalfd4, -1, &ss, 8, SFD_CLOEXEC); if (fd == -1) { puts ("signalfd4(SFD_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("signalfd4(SFD_CLOEXEC) does not set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Ulrich Drepper	7d9dbca342	flag parameters: anon_inode_getfd extension This patch just extends the anon_inode_getfd interface to take an additional parameter with a flag value. The flag value is passed on to get_unused_fd_flags in anticipation for a use with the O_CLOEXEC flag. No actual semantic changes here, the changed callers all pass 0 for now. [akpm@linux-foundation.org: KVM fix] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Akinobu Mita	6e2c10a12a	binfmt_misc: use simple_read_from_buffer() Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Jon Tollefson	f4a67cceee	fs: check for statfs overflow Adds a check for an overflow in the filesystem size so if someone is checking with statfs() on a 16G blocksize hugetlbfs in a 32bit binary that it will report back EOVERFLOW instead of a size of 0. Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:19 -07:00
Andi Kleen	a137e1cc6d	hugetlbfs: per mount huge page sizes Add the ability to configure the hugetlb hstate used on a per mount basis. - Add a new pagesize= option to the hugetlbfs mount that allows setting the page size - This option causes the mount code to find the hstate corresponding to the specified size, and sets up a pointer to the hstate in the mount's superblock. - Change the hstate accessors to use this information rather than the global_hstate they were using (requires a slight change in mm/memory.c so we don't NULL deref in the error-unmap path -- see comments). [np: take hstate out of hugetlbfs inode and vma->vm_private_data] Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:17 -07:00
Andi Kleen	a551643895	hugetlb: modular state for hugetlb page size The goal of this patchset is to support multiple hugetlb page sizes. This is achieved by introducing a new struct hstate structure, which encapsulates the important hugetlb state and constants (eg. huge page size, number of huge pages currently allocated, etc). The hstate structure is then passed around the code which requires these fields, they will do the right thing regardless of the exact hstate they are operating on. This patch adds the hstate structure, with a single global instance of it (default_hstate), and does the basic work of converting hugetlb to use the hstate. Future patches will add more hstate structures to allow for different hugetlbfs mounts to have different page sizes. [akpm@linux-foundation.org: coding-style fixes] Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:17 -07:00
Eric Dumazet	a47a126ad5	vmallocinfo: add NUMA information Christoph recently added /proc/vmallocinfo file to get information about vmalloc allocations. This patch adds NUMA specific information, giving number of pages allocated on each memory node. This should help to check that vmalloc() is able to respect NUMA policies. Example of output on a four nodes machine (one cpu per node) 1) network hash tables are evenly spreaded on four nodes (OK) (Same point for inodes and dentries hash tables) 2) iptables tables (x_tables) are correctly allocated on each cpu node (OK). 3) sys_swapon() allocates its memory from one node only. 4) each loaded module is using memory on one node. Sysadmins could tune their setup to change points 3) and 4) if necessary. grep "pages=" /proc/vmallocinfo 0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204/0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 0xffffc2000031a000-0xffffc2000031d000 12288 alloc_large_system_hash+0x204/0x2c0 pages=2 vmalloc N1=1 N2=1 0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e/0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 0xffffc2000033e000-0xffffc20000341000 12288 sys_swapon+0x640/0xac0 pages=2 vmalloc N0=2 0xffffc20000341000-0xffffc20000344000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N0=2 0xffffc20000344000-0xffffc20000347000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N1=2 0xffffc20000347000-0xffffc2000034a000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N2=2 0xffffc2000034a000-0xffffc2000034d000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N3=2 0xffffc20004381000-0xffffc20004402000 528384 alloc_large_system_hash+0x204/0x2c0 pages=128 vmalloc N0=32 N1=32 N2=32 N3=32 0xffffc20004402000-0xffffc20004803000 4198400 alloc_large_system_hash+0x204/0x2c0 pages=1024 vmalloc vpages N0=256 N1=256 N2=256 N3=256 0xffffc20004803000-0xffffc20004904000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 0xffffc20004904000-0xffffc20004bec000 `3047424` sys_swapon+0x640/0xac0 pages=743 vmalloc vpages N0=743 0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 pages=14 vmalloc N1=14 0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 pages=4 vmalloc N0=4 0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N0=2 0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N1=10 0xffffffffa0022000-0xffffffffa0028000 24576 sys_init_module+0xc27/0x1d00 pages=5 vmalloc N3=5 0xffffffffa0028000-0xffffffffa0050000 163840 sys_init_module+0xc27/0x1d00 pages=39 vmalloc N1=39 0xffffffffa0050000-0xffffffffa0052000 8192 sys_init_module+0xc27/0x1d00 pages=1 vmalloc N1=1 0xffffffffa0052000-0xffffffffa0056000 16384 sys_init_module+0xc27/0x1d00 pages=3 vmalloc N1=3 0xffffffffa0056000-0xffffffffa0081000 176128 sys_init_module+0xc27/0x1d00 pages=42 vmalloc N3=42 0xffffffffa0081000-0xffffffffa00ae000 184320 sys_init_module+0xc27/0x1d00 pages=44 vmalloc N3=44 0xffffffffa00ae000-0xffffffffa00b1000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2 0xffffffffa00b1000-0xffffffffa00b9000 32768 sys_init_module+0xc27/0x1d00 pages=7 vmalloc N0=7 0xffffffffa00b9000-0xffffffffa00c4000 45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N3=10 0xffffffffa00c6000-0xffffffffa00e0000 106496 sys_init_module+0xc27/0x1d00 pages=25 vmalloc N2=25 0xffffffffa00e0000-0xffffffffa00f1000 69632 sys_init_module+0xc27/0x1d00 pages=16 vmalloc N2=16 0xffffffffa00f1000-0xffffffffa00f4000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2 0xffffffffa00f4000-0xffffffffa00f7000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2 [akpm@linux-foundation.org: fix comment] Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:17 -07:00
Pavel Machek	cce7708158	SYNC_FILE_RANGE_WRITE may and will block. Document that. [akpm@linux-foundation.org: fix comment text] Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:17 -07:00
Mel Gorman	04f2cbe356	hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed After patch 2 in this series, a process that successfully calls mmap() for a MAP_PRIVATE mapping will be guaranteed to successfully fault until a process calls fork(). At that point, the next write fault from the parent could fail due to COW if the child still has a reference. We only reserve pages for the parent but a copy must be made to avoid leaking data from the parent to the child after fork(). Reserves could be taken for both parent and child at fork time to guarantee faults but if the mapping is large it is highly likely we will not have sufficient pages for the reservation, and it is common to fork only to exec() immediatly after. A failure here would be very undesirable. Note that the current behaviour of mainline with MAP_PRIVATE pages is pretty bad. The following situation is allowed to occur today. 1. Process calls mmap(MAP_PRIVATE) 2. Process calls mlock() to fault all pages and makes sure it succeeds 3. Process forks() 4. Process writes to MAP_PRIVATE mapping while child still exists 5. If the COW fails at this point, the process gets SIGKILLed even though it had taken care to ensure the pages existed This patch improves the situation by guaranteeing the reliability of the process that successfully calls mmap(). When the parent performs COW, it will try to satisfy the allocation without using reserves. If that fails the parent will steal the page leaving any children without a page. Faults from the child after that point will result in failure. If the child COW happens first, an attempt will be made to allocate the page without reserves and the child will get SIGKILLed on failure. To summarise the new behaviour: 1. If the original mapper performs COW on a private mapping with multiple references, it will attempt to allocate a hugepage from the pool or the buddy allocator without using the existing reserves. On fail, VMAs mapping the same area are traversed and the page being COW'd is unmapped where found. It will then steal the original page as the last mapper in the normal way. 2. The VMAs the pages were unmapped from are flagged to note that pages with data no longer exist. Future no-page faults on those VMAs will terminate the process as otherwise it would appear that data was corrupted. A warning is printed to the console that this situation occured. 2. If the child performs COW first, it will attempt to satisfy the COW from the pool if there are enough pages or via the buddy allocator if overcommit is allowed and the buddy allocator can satisfy the request. If it fails, the child will be killed. If the pool is large enough, existing applications will not notice that the reserves were a factor. Existing applications depending on the no-reserves been set are unlikely to exist as for much of the history of hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that point or failing the mmap(). [npiggin@suse.de: fix CONFIG_HUGETLB=n build] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:16 -07:00
Mel Gorman	a1e78772d7	hugetlb: reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in a similar manner to the reservations taken for MAP_SHARED mappings. The reserve count is accounted both globally and on a per-VMA basis for private mappings. This guarantees that a process that successfully calls mmap() will successfully fault all pages in the future unless fork() is called. The characteristics of private mappings of hugetlbfs files behaviour after this patch are; 1. The process calling mmap() is guaranteed to succeed all future faults until it forks(). 2. On fork(), the parent may die due to SIGKILL on writes to the private mapping if enough pages are not available for the COW. For reasonably reliable behaviour in the face of a small huge page pool, children of hugepage-aware processes should not reference the mappings; such as might occur when fork()ing to exec(). 3. On fork(), the child VMAs inherit no reserves. Reads on pages already faulted by the parent will succeed. Successful writes will depend on enough huge pages being free in the pool. 4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper and at fault time otherwise. Before this patch, all reads or writes in the child potentially needs page allocations that can later lead to the death of the parent. This applies to reads and writes of uninstantiated pages as well as COW. After the patch it is only a write to an instantiated page that causes problems. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:16 -07:00
Kentaro Makita	da3bbdd463	fix soft lock up at NFS mount via per-SB LRU-list of unused dentries [Summary] Split LRU-list of unused dentries to one per superblock to avoid soft lock up during NFS mounts and remounting of any filesystem. Previously I posted here: http://lkml.org/lkml/2008/3/5/590 [Descriptions] - background dentry_unused is a list of dentries which are not referenced. dentry_unused grows up when references on directories or files are released. This list can be very long if there is huge free memory. - the problem When shrink_dcache_sb() is called, it scans all dentry_unused linearly under spin_lock(), and if dentry->d_sb is differnt from given superblock, scan next dentry. This scan costs very much if there are many entries, and very ineffective if there are many superblocks. IOW, When we need to shrink unused dentries on one dentry, but scans unused dentries on all superblocks in the system. For example, we scan 500 dentries to unmount a filesystem, but scans 1,000,000 or more unused dentries on other superblocks. In our case , At mounting NFS, shrink_dcache_sb() is called to shrink unused dentries on NFS, but scans 100,000,000 unused dentries on superblocks in the system such as local ext3 filesystems. I hear NFS mounting took 1 min on some system in use. : NFS uses virtual filesystem in rpc layer, so NFS is affected by this problem. 100,000,000 is possible number on large systems. Per-superblock LRU of unused dentried can reduce the cost in reasonable manner. - How to fix I found this problem is solved by David Chinner's "Per-superblock unused dentry LRU lists V3"(1), so I rebase it and add some fix to reclaim with fairness, which is in Andrew Morton's comments(2). 1) http://lkml.org/lkml/2006/5/25/318 2) http://lkml.org/lkml/2006/5/25/320 Split LRU-list of unused dentries to each superblocks. Then, NFS mounting will check dentries under a superblock instead of all. But this spliting will break LRU of dentry-unused. So, I've attempted to make reclaim unused dentrins with fairness by calculate number of dentries to scan on this sb based on following way number of dentries to scan on this sb = count * (number of dentries on this sb / number of dentries in the machine) - ToDo - I have to measuring performance number and do stress tests. - When unmount occurs during prune_dcache(), scanning on same superblock, It is unable to reach next superblock because it is gone away. We restart scannig superblock from first one, it causes unfairness of reclaim unused dentries on first superblock. But I think this happens very rarely. - Test Results Result on 6GB boxes with excessive unused dentries. Without patch: $ cat /proc/sys/fs/dentry-state 10181835 10180203 45 0 0 0 # mount -t nfs 10.124.60.70:/work/kernel-src nfs real 0m1.830s user 0m0.001s sys 0m1.653s With this patch: $ cat /proc/sys/fs/dentry-state 10236610 10234751 45 0 0 0 # mount -t nfs 10.124.60.70:/work/kernel-src nfs real 0m0.106s user 0m0.002s sys 0m0.032s [akpm@linux-foundation.org: fix comments] Signed-off-by: Kentaro Makita <k-makita@np.css.fujitsu.com> Cc: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: David Chinner <dgc@sgi.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:15 -07:00
Jan Beulich	42b7772812	mm: remove double indirection on tlb parameter to free_pgd_range() & Co The double indirection here is not needed anywhere and hence (at least) confusing. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:15 -07:00
Adrian Bunk	c748e1340e	mm/vmstat.c: proper externs This patch adds proper extern declarations for five variables in include/linux/vmstat.h Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:14 -07:00
Li Zefan	ec05e868ac	ext4: improve ext4_fill_flex_info() a bit - use kzalloc() instead of kmalloc() + memset() - improve a printk info Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-24 12:49:59 -04:00
Shirish Pargaonkar	ef571cadd5	[CIFS] Fix warnings from checkpatch Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 15:56:05 +00:00
Shirish Pargaonkar	b1910ad622	[CIFS] Fix improper endian conversion of ACL subauth field In mode_to_acl when converting a Unix mode to a Windows ACL the subauth fields of the SID in the ACL were translated incorrectly on bigendian architectures Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 14:53:20 +00:00
Shirish Pargaonkar	76c510ad2e	[CIFS] Fix possible double free if search immediately after search rewind fails Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 14:48:33 +00:00
Steve French	99b1f5b2f6	[CIFS] remove checkpatch warning Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 02:37:45 +00:00
Alexey Dobriyan	f984c7b982	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 02:34:24 +00:00
Harvey Harrison	5ca33c6ac3	cifs: assorted endian annotations fs/cifs/cifssmb.c:3917:13: warning: incorrect type in assignment (different base types) fs/cifs/cifssmb.c:3917:13: expected bool [unsigned] [usertype] is_unicode fs/cifs/cifssmb.c:3917:13: got restricted __le16 The comment explains why __force is used here. fs/cifs/connect.c:458:16: warning: cast to restricted __be32 fs/cifs/connect.c:458:16: warning: cast to restricted __be32 fs/cifs/connect.c:458:16: warning: cast to restricted __be32 fs/cifs/connect.c:458:16: warning: cast to restricted __be32 fs/cifs/connect.c:458:16: warning: cast to restricted __be32 fs/cifs/connect.c:458:16: warning: cast to restricted __be32 Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 01:14:41 +00:00
Jeff Layton	8efdbde647	[CIFS] break ATTR_SIZE changes out into their own function Move the code that handles ATTR_SIZE changes to its own function. This makes for a smaller function and reduces the level of indentation. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-23 21:28:12 +00:00
Jeff Layton	09e50d55a9	lockdep: annotate cifs in-kernel sockets Put CIFS sockets in their own class to avoid some lockdep warnings. CIFS sockets are not exposed to user-space, and so are not subject to the same deadlock scenarios. A similar change was made a couple of years ago for RPC sockets in commit `ed07536ed6`. This patch should prevent lockdep false-positives like this one: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.18-98.el5.jtltest.38.bz456320.1debug #1 ------------------------------------------------------- test5/2483 is trying to acquire lock: (sk_lock-AF_INET){--..}, at: [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f but task is already holding lock: (&inode->i_alloc_sem){--..}, at: [<ffffffff8002e454>] notify_change+0xf5/0x2e0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&inode->i_alloc_sem){--..}: [<ffffffff800a817c>] __lock_acquire+0x9a9/0xadf [<ffffffff800a8a72>] lock_acquire+0x55/0x70 [<ffffffff8002e454>] notify_change+0xf5/0x2e0 [<ffffffff800a4e36>] down_write+0x3c/0x68 [<ffffffff8002e454>] notify_change+0xf5/0x2e0 [<ffffffff800e358d>] do_truncate+0x50/0x6b [<ffffffff8005197c>] get_write_access+0x40/0x46 [<ffffffff80012cf1>] may_open+0x1d3/0x22e [<ffffffff8001bc81>] open_namei+0x2c6/0x6dd [<ffffffff800289c6>] do_filp_open+0x1c/0x38 [<ffffffff800683ef>] _spin_unlock+0x17/0x20 [<ffffffff800167a7>] get_unused_fd+0xf9/0x107 [<ffffffff8001a704>] do_sys_open+0x44/0xbe [<ffffffff80060116>] system_call+0x7e/0x83 [<ffffffffffffffff>] 0xffffffffffffffff -> #2 (&sysfs_inode_imutex_key){--..}: [<ffffffff800a817c>] __lock_acquire+0x9a9/0xadf [<ffffffff8010f6df>] create_dir+0x26/0x1d7 [<ffffffff800a8a72>] lock_acquire+0x55/0x70 [<ffffffff8010f6df>] create_dir+0x26/0x1d7 [<ffffffff800671c0>] mutex_lock_nested+0x104/0x29c [<ffffffff800a819d>] __lock_acquire+0x9ca/0xadf [<ffffffff8010f6df>] create_dir+0x26/0x1d7 [<ffffffff8010fc67>] sysfs_create_dir+0x58/0x76 [<ffffffff8015144c>] kobject_add+0xdb/0x198 [<ffffffff801be765>] class_device_add+0xb2/0x465 [<ffffffff8005a6ff>] kobject_get+0x12/0x17 [<ffffffff80225265>] register_netdevice+0x270/0x33e [<ffffffff8022538c>] register_netdev+0x59/0x67 [<ffffffff80464d40>] net_olddevs_init+0xb/0xac [<ffffffff80448a79>] init+0x1f9/0x2fc [<ffffffff80068885>] _spin_unlock_irq+0x24/0x27 [<ffffffff80067f86>] trace_hardirqs_on_thunk+0x35/0x37 [<ffffffff80061079>] child_rip+0xa/0x11 [<ffffffff80068885>] _spin_unlock_irq+0x24/0x27 [<ffffffff800606a8>] restore_args+0x0/0x30 [<ffffffff80179a59>] acpi_ds_init_one_object+0x0/0x80 [<ffffffff80448880>] init+0x0/0x2fc [<ffffffff8006106f>] child_rip+0x0/0x11 [<ffffffffffffffff>] 0xffffffffffffffff -> #1 (rtnl_mutex){--..}: [<ffffffff800a817c>] __lock_acquire+0x9a9/0xadf [<ffffffff8025acf8>] ip_mc_leave_group+0x23/0xb7 [<ffffffff800a8a72>] lock_acquire+0x55/0x70 [<ffffffff8025acf8>] ip_mc_leave_group+0x23/0xb7 [<ffffffff800671c0>] mutex_lock_nested+0x104/0x29c [<ffffffff8025acf8>] ip_mc_leave_group+0x23/0xb7 [<ffffffff802451b0>] do_ip_setsockopt+0x6d1/0x9bf [<ffffffff800a575e>] lock_release_holdtime+0x27/0x48 [<ffffffff800a575e>] lock_release_holdtime+0x27/0x48 [<ffffffff8006a85e>] do_page_fault+0x503/0x835 [<ffffffff8012cbf6>] socket_has_perm+0x5b/0x68 [<ffffffff80245556>] ip_setsockopt+0x22/0x78 [<ffffffff8021c973>] sys_setsockopt+0x91/0xb7 [<ffffffff800602a6>] tracesys+0xd5/0xdf [<ffffffffffffffff>] 0xffffffffffffffff -> #0 (sk_lock-AF_INET){--..}: [<ffffffff800a5037>] print_stack_trace+0x59/0x68 [<ffffffff800a8092>] __lock_acquire+0x8bf/0xadf [<ffffffff800a8a72>] lock_acquire+0x55/0x70 [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f [<ffffffff80035466>] lock_sock+0xd4/0xe4 [<ffffffff80096e91>] _local_bh_enable+0xcb/0xe0 [<ffffffff800606a8>] restore_args+0x0/0x30 [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f [<ffffffff80057540>] sock_sendmsg+0xf3/0x110 [<ffffffff800a2bb6>] autoremove_wake_function+0x0/0x2e [<ffffffff800a10e4>] kernel_text_address+0x1a/0x26 [<ffffffff8006f4e2>] dump_trace+0x211/0x23a [<ffffffff800a6d3d>] find_usage_backwards+0x5f/0x88 [<ffffffff8840221a>] MD5Final+0xaf/0xc2 [cifs] [<ffffffff884032ec>] cifs_calculate_signature+0x55/0x69 [cifs] [<ffffffff8021d891>] kernel_sendmsg+0x35/0x47 [<ffffffff883ff38e>] smb_send+0xa3/0x151 [cifs] [<ffffffff883ff5de>] SendReceive+0x1a2/0x448 [cifs] [<ffffffff800a812f>] __lock_acquire+0x95c/0xadf [<ffffffff883e758a>] CIFSSMBSetEOF+0x20d/0x25b [cifs] [<ffffffff883fa430>] cifs_set_file_size+0x110/0x3b7 [cifs] [<ffffffff883faa89>] cifs_setattr+0x3b2/0x6f6 [cifs] [<ffffffff8002e454>] notify_change+0xf5/0x2e0 [<ffffffff8002e4a4>] notify_change+0x145/0x2e0 [<ffffffff800e358d>] do_truncate+0x50/0x6b [<ffffffff8005197c>] get_write_access+0x40/0x46 [<ffffffff80012cf1>] may_open+0x1d3/0x22e [<ffffffff8001bc81>] open_namei+0x2c6/0x6dd [<ffffffff800289c6>] do_filp_open+0x1c/0x38 [<ffffffff800683ef>] _spin_unlock+0x17/0x20 [<ffffffff800167a7>] get_unused_fd+0xf9/0x107 [<ffffffff8001a704>] do_sys_open+0x44/0xbe [<ffffffff800602a6>] tracesys+0xd5/0xdf [<ffffffffffffffff>] 0xffffffffffffffff other info that might help us debug this: 2 locks held by test5/2483: #0: (&inode->i_mutex){--..}, at: [<ffffffff800e3582>] do_truncate+0x45/0x6b #1: (&inode->i_alloc_sem){--..}, at: [<ffffffff8002e454>] notify_change+0xf5/0x2e0 stack backtrace: Call Trace: [<ffffffff800a6a7b>] print_circular_bug_tail+0x65/0x6e [<ffffffff800a5037>] print_stack_trace+0x59/0x68 [<ffffffff800a8092>] __lock_acquire+0x8bf/0xadf [<ffffffff800a8a72>] lock_acquire+0x55/0x70 [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f [<ffffffff80035466>] lock_sock+0xd4/0xe4 [<ffffffff80096e91>] _local_bh_enable+0xcb/0xe0 [<ffffffff800606a8>] restore_args+0x0/0x30 [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f [<ffffffff80057540>] sock_sendmsg+0xf3/0x110 [<ffffffff800a2bb6>] autoremove_wake_function+0x0/0x2e [<ffffffff800a10e4>] kernel_text_address+0x1a/0x26 [<ffffffff8006f4e2>] dump_trace+0x211/0x23a [<ffffffff800a6d3d>] find_usage_backwards+0x5f/0x88 [<ffffffff8840221a>] :cifs:MD5Final+0xaf/0xc2 [<ffffffff884032ec>] :cifs:cifs_calculate_signature+0x55/0x69 [<ffffffff8021d891>] kernel_sendmsg+0x35/0x47 [<ffffffff883ff38e>] :cifs:smb_send+0xa3/0x151 [<ffffffff883ff5de>] :cifs:SendReceive+0x1a2/0x448 [<ffffffff800a812f>] __lock_acquire+0x95c/0xadf [<ffffffff883e758a>] :cifs:CIFSSMBSetEOF+0x20d/0x25b [<ffffffff883fa430>] :cifs:cifs_set_file_size+0x110/0x3b7 [<ffffffff883faa89>] :cifs:cifs_setattr+0x3b2/0x6f6 [<ffffffff8002e454>] notify_change+0xf5/0x2e0 [<ffffffff8002e4a4>] notify_change+0x145/0x2e0 [<ffffffff800e358d>] do_truncate+0x50/0x6b [<ffffffff8005197c>] get_write_access+0x40/0x46 [<ffffffff80012cf1>] may_open+0x1d3/0x22e [<ffffffff8001bc81>] open_namei+0x2c6/0x6dd [<ffffffff800289c6>] do_filp_open+0x1c/0x38 [<ffffffff800683ef>] _spin_unlock+0x17/0x20 [<ffffffff800167a7>] get_unused_fd+0xf9/0x107 [<ffffffff8001a704>] do_sys_open+0x44/0xbe [<ffffffff800602a6>] tracesys+0xd5/0xdf Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-23 18:25:38 +00:00
Linus Torvalds	c010b2f76c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (82 commits) ipw2200: Call netif_*_queue() interfaces properly. netxen: Needs to include linux/vmalloc.h [netdrvr] atl1d: fix !CONFIG_PM build r6040: rework init_one error handling r6040: bump release number to 0.18 r6040: handle RX fifo full and no descriptor interrupts r6040: change the default waiting time r6040: use definitions for magic values in descriptor status r6040: completely rework the RX path r6040: call napi_disable when puting down the interface and set lp->dev accordingly. mv643xx_eth: fix NETPOLL build r6040: rework the RX buffers allocation routine r6040: fix scheduling while atomic in r6040_tx_timeout r6040: fix null pointer access and tx timeouts r6040: prefix all functions with r6040 rndis_host: support WM6 devices as modems at91_ether: use netstats in net_device structure sfc: Create one RX queue and interrupt per CPU package by default sfc: Use a separate workqueue for resets sfc: I2C adapter initialisation fixes ...	2008-07-22 19:09:51 -07:00
Adrian Bunk	8086cd451f	netns: make get_proc_net() static get_proc_net() can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:19:19 -07:00
Linus Torvalds	53baaaa968	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (79 commits) arm: bus_id -> dev_name() and dev_set_name() conversions sparc64: fix up bus_id changes in sparc core code 3c59x: handle pci_name() being const MTD: handle pci_name() being const HP iLO driver sysdev: Convert the x86 mce tolerant sysdev attribute to generic attribute sysdev: Add utility functions for simple int/ulong variable sysdev attributes sysdev: Pass the attribute to the low level sysdev show/store function driver core: Suppress sysfs warnings for device_rename(). kobject: Transmit return value of call_usermodehelper() to caller sysfs-rules.txt: reword API stability statement debugfs: Implement debugfs_remove_recursive() HOWTO: change email addresses of James in HOWTO always enable FW_LOADER unless EMBEDDED=y uio-howto.tmpl: use unique output names uio-howto.tmpl: use standard copyright/legal markings sysfs: don't call notify_change sysdev: fix debugging statements in registration code. kobject: should use kobject_put() in kset-example kobject: reorder kobject to save space on 64 bit builds ...	2008-07-22 13:13:47 -07:00
Alexey Dobriyan	ee1e6ab605	proc: fix /proc/*/pagemap some more struct pagemap_walk was placed on stack, some hooks are initialized, the rest (->pgd_entry, ->pud_entry, ->pte_entry) are valid but junk. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Vegard Nossum" <vegard.nossum@gmail.com> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-22 09:59:41 -07:00
John Reiser	6519108746	execve filename: document and export via auxiliary vector The Linux kernel puts the filename argument of execve() into the new address space. Many developers are surprised to learn this. Those who know and could use it, object "But it's not documented." Those who want to use it dislike the expression (char *)(1+ strlen(env[-1+ n_env]) + env[-1+ n_env]) because it requires locating the last original environment variable, and assumes that the filename follows the characters. This patch documents the insertion of the filename, and makes it easier to find by adding a new tag AT_EXECFN in the ElfXX_auxv_t; see <elf.h>. In many cases readlink("/proc/self/exe",) gives the same answer. But if all the original pages get unmapped, then the kernel erases the symlink for /proc/self/exe. This can happen when a program decompressor does a good job of cleaning up after uncompressing directly to memory, so that the address space of the target program looks the same as if compression had never happened. One example is http://upx.sourceforge.net . One notable use of the underlying concept (what path containED the executable) is glibc expanding $ORIGIN in DT_RUNPATH. In practice for the near term, it may be a good idea for user-mode code to use both /proc/self/exe and AT_EXECFN as fall-back methods for each other. /proc/self/exe can fail due to unmapping, AT_EXECFN can fail because it won't be present on non-new systems. The auxvec or {AT_EXECFN}.d_val also can get overwritten, although in nearly all cases this would be the result of a bug. The runtime cost is one NEW_AUX_ENT using two words of stack space. The underlying value is maintained already as bprm->exec; setup_arg_pages() in fs/exec.c slides it for stack_shift, etc. Signed-off-by: John Reiser <jreiser@BitWagon.com> Cc: Roland McGrath <roland@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-22 09:59:40 -07:00
Jan Beulich	04e1e0ccca	[CIFS] Fix compiler warning on 64-bit Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-22 13:04:18 +00:00
Cornelia Huck	36ce6dad6e	driver core: Suppress sysfs warnings for device_rename(). driver core: Suppress sysfs warnings for device_rename(). Renaming network devices to an already existing name is not something we want sysfs to print a scary warning for, since the callers can deal with this correctly. So let's introduce sysfs_create_link_nowarn() which gets rid of the common warning. Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-07-21 21:55:01 -07:00
Haavard Skinnemoen	9505e63756	debugfs: Implement debugfs_remove_recursive() debugfs_remove_recursive() will remove a dentry and all its children. Drivers can use this to zap their whole debugfs tree so that they don't need to keep track of every single debugfs dentry they created. It may fail to remove the whole tree in certain cases: sh-3.2# rmmod atmel-mci < /sys/kernel/debug/mmc0/ios/clock mmc0: card b368 removed atmel_mci atmel_mci.0: Lost dma0chan1, falling back to PIO sh-3.2# ls /sys/kernel/debug/mmc0/ ios But I'm not sure if that case can be handled in any sane manner. Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com> Cc: Pierre Ossman <drzeus-list@drzeus.cx> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-07-21 21:54:59 -07:00
Miklos Szeredi	93265d13ea	sysfs: don't call notify_change sysfs_chmod_file() calls notify_change() to change the permission bits on a sysfs file. Replace with explicit call to sysfs_setattr() and fsnotify_change(). This is equivalent, except that security_inode_setattr() is not called. This function is called by drivers, so the security checks do not make any sense. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-07-21 21:54:57 -07:00
Kay Sievers	aab0de2451	driver core: remove KOBJ_NAME_LEN define Kobjects do not have a limit in name size since a while, so stop pretending that they do. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-07-21 21:54:52 -07:00
Greg Kroah-Hartman	6143b59970	device create: coda: convert device_create to device_create_drvdata device_create() is race-prone, so use the race-free device_create_drvdata() instead as device_create() is going away. Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-07-21 21:54:41 -07:00
Linus Torvalds	14b395e35d	Merge branch 'for-2.6.27' of git://linux-nfs.org/~bfields/linux * 'for-2.6.27' of git://linux-nfs.org/~bfields/linux: (51 commits) nfsd: nfs4xdr.c do-while is not a compound statement nfsd: Use C99 initializers in fs/nfsd/nfs4xdr.c lockd: Pass "struct sockaddr *" to new failover-by-IP function lockd: get host reference in nlmsvc_create_block() instead of callers lockd: minor svclock.c style fixes lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_lock lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_testlock lockd: nlm_release_host() checks for NULL, caller needn't file lock: reorder struct file_lock to save space on 64 bit builds nfsd: take file and mnt write in nfs4_upgrade_open nfsd: document open share bit tracking nfsd: tabulate nfs4 xdr encoding functions nfsd: dprint operation names svcrdma: Change WR context get/put to use the kmem cache svcrdma: Create a kmem cache for the WR contexts svcrdma: Add flush_scheduled_work to module exit function svcrdma: Limit ORD based on client's advertised IRD svcrdma: Remove unused wait q from svcrdma_xprt structure svcrdma: Remove unneeded spin locks from __svc_rdma_free svcrdma: Add dma map count and WARN_ON ...	2008-07-20 21:21:46 -07:00
Linus Torvalds	db6d8c7a40	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (1232 commits) iucv: Fix bad merging. net_sched: Add size table for qdiscs net_sched: Add accessor function for packet length for qdiscs net_sched: Add qdisc_enqueue wrapper highmem: Export totalhigh_pages. ipv6 mcast: Omit redundant address family checks in ip6_mc_source(). net: Use standard structures for generic socket address structures. ipv6 netns: Make several "global" sysctl variables namespace aware. netns: Use net_eq() to compare net-namespaces for optimization. ipv6: remove unused macros from net/ipv6.h ipv6: remove unused parameter from ip6_ra_control tcp: fix kernel panic with listening_get_next tcp: Remove redundant checks when setting eff_sacks tcp: options clean up tcp: Fix MD5 signatures for non-linear skbs sctp: Update sctp global memory limit allocations. sctp: remove unnecessary byteshifting, calculate directly in big-endian sctp: Allow only 1 listening socket with SO_REUSEADDR sctp: Do not leak memory on multiple listen() calls sctp: Support ipv6only AF_INET6 sockets. ...	2008-07-20 17:43:29 -07:00
Linus Torvalds	f7df406dce	Merge branch 'configfs-fixup-ptr-error' of git://oss.oracle.com/git/jlbec/linux-2.6 * 'configfs-fixup-ptr-error' of git://oss.oracle.com/git/jlbec/linux-2.6: configfs: Allow ->make_item() and ->make_group() to return detailed errors. Revert "configfs: Allow ->make_item() and ->make_group() to return detailed errors."	2008-07-20 17:17:52 -07:00
Alan Cox	a352def21a	tty: Ldisc revamp Move the line disciplines towards a conventional ->ops arrangement. For the moment the actual 'tty_ldisc' struct in the tty is kept as part of the tty struct but this can then be changed if it turns out that when it all settles down we want to refcount ldiscs separately to the tty. Pull the ldisc code out of /proc and put it with our ldisc code. Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-20 17:12:34 -07:00
David S. Miller	407d819cf0	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6	2008-07-19 00:30:39 -07:00
Harvey Harrison	5108b27651	nfsd: nfs4xdr.c do-while is not a compound statement The WRITEMEM macro produces sparse warnings of the form: fs/nfsd/nfs4xdr.c:2668:2: warning: do-while statement is not a compound statement Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-18 15:18:35 -04:00
J. Bruce Fields	ad1060c89c	nfsd: Use C99 initializers in fs/nfsd/nfs4xdr.c Thanks to problem report and original patch from Harvey Harrison. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Harvey Harrison <harvey.harrison@gmail.com> Cc: Benny Halevy <bhalevy@panasas.com>	2008-07-18 15:04:58 -04:00
Pavel Emelyanov	b6fcbdb4f2	proc: consolidate per-net single-release callers They are symmetrical to single_open ones :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:07:44 -07:00
Pavel Emelyanov	de05c557b2	proc: consolidate per-net single_open callers There are already 7 of them - time to kill some duplicate code. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:07:21 -07:00
David S. Miller	49997d7515	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 Conflicts: Documentation/powerpc/booting-without-of.txt drivers/atm/Makefile drivers/net/fs_enet/fs_enet-main.c drivers/pci/pci-acpi.c net/8021q/vlan.c net/iucv/iucv.c	2008-07-18 02:39:39 -07:00
Joel Becker	a6795e9ebb	configfs: Allow ->make_item() and ->make_group() to return detailed errors. The configfs operations ->make_item() and ->make_group() currently return a new item/group. A return of NULL signifies an error. Because of this, -ENOMEM is the only return code bubbled up the stack. Multiple folks have requested the ability to return specific error codes when these operations fail. This patch adds that ability by changing the ->make_item/group() ops to return ERR_PTR() values. These errors are bubbled up appropriately. NULL returns are changed to -ENOMEM for compatibility. Also updated are the in-kernel users of configfs. This is a rework of reverted commit `11c3b79218`. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-17 15:21:29 -07:00
Joel Becker	f89ab8619e	Revert "configfs: Allow ->make_item() and ->make_group() to return detailed errors." This reverts commit `11c3b79218`. The code will move to PTR_ERR(). Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-17 14:53:48 -07:00
Aneesh Kumar K.V	12219aea6b	ext4: Cleanup the block reservation code path The truncate patch should not use the i_allocated_meta_blocks value. So add seperate functions to be used in the truncate and alloc path. We also need to release the meta-data block that we reserved for the blocks that we are truncating. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-17 16:12:08 -04:00
Theodore Ts'o	34071da71a	ext4: don't assume extents can't cross block groups when truncating With the FLEX_BG layout, there is no reason why extents can't cross block groups, so make the truncate code reserve enough credits so we don't BUG if we come across such an extent. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-01 21:59:19 -04:00
Theodore Ts'o	bc965ab3f2	ext4: Fix lack of credits BUG() when deleting a badly fragmented inode The extents codepath for ext4_truncate() requests journal transaction credits in very small chunks, requesting only what is needed. This means there may not be enough credits left on the transaction handle after ext4_truncate() returns and then when ext4_delete_inode() tries finish up its work, it may not have enough transaction credits, causing a BUG() oops in the jbd2 core. Also, reserve an extra 2 blocks when starting an ext4_delete_inode() since we need to update the inode bitmap, as well as update the orphaned inode linked list. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-02 21:10:38 -04:00
Theodore Ts'o	0123c93998	ext4: Fix ext4_ext_journal_restart() The ext4_ext_journal_restart() is a convenience function which checks to see if the requested number of credits is present, and if so it closes the current transaction and attaches the current handle to the new transaction. Unfortunately, it wasn't proprely checking the return value from ext4_journal_extend(), so it was starting a new transaction when one was not necessary, and returning an error when all that was necessary was to restart the handle with a new transaction. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-01 20:57:54 -04:00
Eric Sandeen	d5a0d4f732	ext4: fix ext4_da_write_begin error path ext4_da_write_begin needs to call journal_stop before returning, if the page allocation fails. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-02 18:51:06 -04:00
Hidehiro Kawai	e9e34f4e8f	jbd2: don't abort if flushing file data failed In ordered mode, the current jbd2 aborts the journal if a file data buffer has an error. But this behavior is unintended, and we found that it has been adopted accidentally. This patch undoes it and just calls printk() instead of aborting the journal. Unlike a similar patch for ext3/jbd, file data buffers are written via generic_writepages(). But we also need to set AS_EIO into their mappings because wait_on_page_writeback_range() clears AS_EIO before a user process sees it. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-31 22:26:04 -04:00
Hidehiro Kawai	9c83a923c6	ext4: don't read inode block if the buffer has a write error A transient I/O error can corrupt inode data. Here is the scenario: (1) update inode_A at the block_B (2) pdflush writes out new inode_A to the filesystem, but it results in write I/O error, at this point, BH_Uptodate flag of the buffer for block_B is cleared and BH_Write_EIO is set (3) create new inode_C which located at block_B, and __ext4_get_inode_loc() tries to read on-disk block_B because the buffer is not uptodate (4) if it can read on-disk block_B successfully, inode_A is overwritten by old data This patch makes __ext4_get_inode_loc() not read the inode block if the buffer has BH_Write_EIO flag. In this case, the buffer should have the latest information, so setting the uptodate flag to the buffer (this avoids WARN_ON_ONCE() in mark_buffer_dirty().) According to this change, we would need to test BH_Write_EIO flag for the error checking. Currently nobody checks write I/O errors on metadata buffers, but it will be done in other patches I'm working on. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: sugita <yumiko.sugita.yf@hitachi.com> Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Jan Kara <jack@ucw.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-26 16:39:26 -04:00
Aneesh Kumar K.V	6be2ded1d7	ext4: Don't allow lg prealloc list to be grow large. Currently, the locality group prealloc list is freed only when there is a block allocation failure. This can result in large number of entries in the preallocation list making ext4_mb_use_preallocated() expensive. To fix this, we convert the locality group prealloc list to a hash list. The hash index is the order of number of blocks in the prealloc space with a max order of 9. When adding prealloc space to the list we make sure total entries for each order does not exceed 8. If it is more than 8 we discard few entries and make sure the we have only <= 5 entries. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-23 14:14:05 -04:00
Aneesh Kumar K.V	1320cbcf77	ext4: Convert the usage of NR_CPUS to nr_cpu_ids. NR_CPUS can be really large. We should be using nr_cpu_ids instead. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-23 14:09:26 -04:00
Aneesh Kumar K.V	ce89f46cb8	ext4: Improve error handling in mballoc Don't call BUG_ON on file system failures. Instead use ext4_error and also handle the continue case properly. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-23 14:09:29 -04:00
Eric Sandeen	b5f10eed81	ext4: lock block groups when initializing I noticed when filling a 1T filesystem with 4 threads using the fs_mark benchmark: fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0 that I occasionally got checksum mismatch errors: EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 6935 etc. I'd reliably get 4-5 of them during the run. It appears that the problem is likely a race to init the bg's when the uninit_bg feature is enabled. With the patch below, which adds sb_bgl_locking around initialization, I was able to complete several runs with no errors or warnings. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-08-02 21:21:08 -04:00
Eric Sandeen	e29d1cde63	ext4: sync up block and inode bitmap reading functions ext4_read_block_bitmap and read_inode_bitmap do essentially the same thing, and yet they are structured quite differently. I came across this difference while looking at doing bg locking during bg initialization. This patch: * removes unnecessary casts in the error messages * renames read_inode_bitmap to ext4_read_inode_bitmap * and more substantially, restructures the inode bitmap reading function to be more like the block bitmap counterpart. The change to the inode bitmap reader simplifies the locking to be applied in the next patch. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-08-02 21:21:02 -04:00
Theodore Ts'o	8a266467b8	ext4: Allow read/only mounts with corrupted block group checksums If the block group checksums are corrupted, still allow the mount to succeed, so e2fsck can have a chance to try to fix things up. Add code in the remount r/w path to make sure the block group checksums are valid before allowing the filesystem to be remounted read/write. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-26 14:34:21 -04:00
Aneesh Kumar K.V	d03856bd5e	ext4: Fix data corruption when writing to prealloc area Inserting an extent can cause a new entry in the already existing index block. That doesn't increase the depth of the instead. Instead it adds a new leaf block. Now with the new leaf block the path information corresponding to the logical block should be fetched from the new block. The old path will be pointing to the old leaf block. We need to recalucate the path information on extent insert even if depth doesn't change. Without this change, the extent merge after converting an unwritten extent to initialized extent takes the wrong extent and cause data corruption. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-08-02 18:51:32 -04:00
Linus Torvalds	5b664cb235	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: [PATCH] ocfs2: fix oops in mmap_truncate testing configfs: call drop_link() to cleanup after create_link() failure configfs: Allow ->make_item() and ->make_group() to return detailed errors. configfs: Fix failing mkdir() making racing rmdir() fail configfs: Fix deadlock with racing rmdir() and rename() configfs: Make configfs_new_dirent() return error code instead of NULL configfs: Protect configfs_dirent s_links list mutations configfs: Introduce configfs_dirent_lock ocfs2: Don't snprintf() without a format. ocfs2: Fix CONFIG_OCFS2_DEBUG_FS #ifdefs ocfs2/net: Silence build warnings on sparc64 ocfs2: Handle error during journal load ocfs2: Silence an error message in ocfs2_file_aio_read() ocfs2: use simple_read_from_buffer() ocfs2: fix printk format warnings with OCFS2_FS_STATS=n [PATCH 2/2] ocfs2: Instrument fs cluster locks [PATCH 1/2] ocfs2: Add CONFIG_OCFS2_FS_STATS config option	2008-07-17 10:55:51 -07:00
Coly Li	c0420ad2ca	[PATCH] ocfs2: fix oops in mmap_truncate testing This patch fixes a mmap_truncate bug which was found by ocfs2 test suite. In an ocfs2 cluster more than 1 node, run program mmap_truncate, which races mmap writes and truncates from multiple processes. While the test is running, a stat from another node forces writeout, causing an oops in ocfs2_get_block() because it sees a buffer to write which isn't allocated. This patch fixed the bug by clear dirty and uptodate bits in buffer, leave the buffer unmapped and return. Fix is suggested by Mark Fasheh, and I code up the patch. Signed-off-by: Coly Li <coyli@suse.de> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-16 16:13:04 -07:00
Linus Torvalds	9c1be0c471	Merge branch 'for_linus' of git://git.infradead.org/~dedekind/ubifs-2.6 * 'for_linus' of git://git.infradead.org/~dedekind/ubifs-2.6: UBIFS: include to compilation UBIFS: add new flash file system UBIFS: add brief documentation MAINTAINERS: add UBIFS section do_mounts: allow UBI root device name VFS: export sync_sb_inodes VFS: move inode_lock into sync_sb_inodes	2008-07-16 15:02:57 -07:00
Linus Torvalds	8df1b049bc	Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (82 commits) NFSv4: Remove BKL from the nfsv4 state recovery SUNRPC: Remove the BKL from the callback functions NFS: Remove BKL from the readdir code NFS: Remove BKL from the symlink code NFS: Remove BKL from the sillydelete operations NFS: Remove the BKL from the rename, rmdir and unlink operations NFS: Remove BKL from NFS lookup code NFS: Remove the BKL from nfs_link() NFS: Remove the BKL from the inode creation operations NFS: Remove BKL usage from open() NFS: Remove BKL usage from the write path NFS: Remove the BKL from the permission checking code NFS: Remove attribute update related BKL references NFS: Remove BKL requirement from attribute updates NFS: Protect inode->i_nlink updates using inode->i_lock nfs: set correct fl_len in nlmclnt_test() SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon SUNRPC: Refactor rpcb_register to make rpcbindv4 support easier SUNRPC: None of rpcb_create's callers wants a privileged source port SUNRPC: Introduce a specific rpcb_create for contacting localhost ...	2008-07-16 14:49:49 -07:00
Randy Dunlap	3c3622dcb6	Fix compile issues in fs/compat_ioctl.c when CONFIG_BLOCK is disabled Fix fs/compat_ioctl.c to handle CONFIG_BLOCK=n, CONFIG_SCSI=n to avoid build errors: In file included from include/scsi/scsi.h:12, from fs/compat_ioctl.c:71: include/scsi/scsi_cmnd.h:27:25: warning: "BLK_MAX_CDB" is not defined include/scsi/scsi_cmnd.h:28:3: error: #error MAX_COMMAND_SIZE can not be bigger than BLK_MAX_CDB In file included from include/scsi/scsi.h:12, from fs/compat_ioctl.c:71: include/scsi/scsi_cmnd.h: In function 'scsi_bidi_cmnd': include/scsi/scsi_cmnd.h:182: error: implicit declaration of function 'blk_bidi_rq' include/scsi/scsi_cmnd.h:183: error: dereferencing pointer to incomplete type include/scsi/scsi_cmnd.h: In function 'scsi_in': include/scsi/scsi_cmnd.h:189: error: dereferencing pointer to incomplete type Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-16 11:46:16 -07:00
Trond Myklebust	cadc723cc1	Merge branch 'bkl-removal' into next	2008-07-15 18:34:58 -04:00
Trond Myklebust	e89e896d31	Merge branch 'devel' into next Conflicts: fs/nfs/file.c Fix up the conflict with Jon Corbet's bkl-removal tree	2008-07-15 18:34:16 -04:00
Trond Myklebust	f839c4c199	NFSv4: Remove BKL from the nfsv4 state recovery Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:57 -04:00
Trond Myklebust	a86dc496b7	SUNRPC: Remove the BKL from the callback functions Push it into those callback functions that actually need it. Note that all the NFS operations use their own locking, so don't need the BKL. Ditto for the rpcbind client. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:57 -04:00
Trond Myklebust	c3cc8c019c	NFS: Remove BKL from the readdir code Page accesses are serialised using the page locks, whereas all attribute updates are serialised using the inode->i_lock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:56 -04:00
Trond Myklebust	76566991f9	NFS: Remove BKL from the symlink code Page cache accesses are serialised using page locks, whereas attribute updates are serialised using inode->i_lock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:56 -04:00
Trond Myklebust	52e2e8d37e	NFS: Remove BKL from the sillydelete operations Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:55 -04:00
Trond Myklebust	bd9bb454b7	NFS: Remove the BKL from the rename, rmdir and unlink operations Attribute updates are safe, and dentry operations are protected using VFS level locks. Defer removing the BKL from sillyrename until a separate patch. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:55 -04:00
Trond Myklebust	fc0f684c21	NFS: Remove BKL from NFS lookup code All dentry-related operations are already BKL-safe, since they are protected by the VFS locking. No extra locks should be needed in the NFS code. In the case of nfs_revalidate_inode(), we're only doing an attribute update (protected by the inode->i_lock). In the case of nfs_lookup(), we're instantiating a new dentry, so there should be no contention possible until after we call d_materialise_unique. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:54 -04:00
Trond Myklebust	fc81af535e	NFS: Remove the BKL from nfs_link() Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:54 -04:00
Trond Myklebust	f1e2eda235	NFS: Remove the BKL from the inode creation operations nfs_instantiate() does not require the BKL, neither do the attribute updates or the RPC code. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:53 -04:00
Trond Myklebust	bba67e0e3f	NFS: Remove BKL usage from open() All the NFSv4 stateful operations are already protected by other locks (in particular by the rpc_sequence locks. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:53 -04:00
Trond Myklebust	b6a2e569e2	NFS: Remove BKL usage from the write path Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:52 -04:00
Trond Myklebust	4d80f2ecd5	NFS: Remove the BKL from the permission checking code Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:52 -04:00
Trond Myklebust	fa6dc9dc59	NFS: Remove attribute update related BKL references Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:51 -04:00
Trond Myklebust	a3d01454bc	NFS: Remove BKL requirement from attribute updates The main problem is dealing with inode->i_size: we need to set the inode->i_lock on all attribute updates, and so vmtruncate won't cut it. Make an NFS-private version of vmtruncate that has the necessary locking semantics. The result should be that the following inode attribute updates are protected by inode->i_lock nfsi->cache_validity nfsi->read_cache_jiffies nfsi->attrtimeo nfsi->attrtimeo_timestamp nfsi->change_attr nfsi->last_updated nfsi->cache_change_attribute nfsi->access_cache nfsi->access_cache_entry_lru nfsi->access_cache_inode_lru nfsi->acl_access nfsi->acl_default nfsi->nfs_page_tree nfsi->ncommit nfsi->npages nfsi->open_files nfsi->silly_list nfsi->acl nfsi->open_states inode->i_size inode->i_atime inode->i_mtime inode->i_ctime inode->i_nlink inode->i_uid inode->i_gid The following is protected by dir->i_mutex nfsi->cookieverf Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:51 -04:00
Trond Myklebust	1b83d70703	NFS: Protect inode->i_nlink updates using inode->i_lock Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:50 -04:00
Felix Blyakher	d67d1c7bf9	nfs: set correct fl_len in nlmclnt_test() fcntl(F_GETLK) on an nfs client incorrectly returns the values for the conflicting lock. fl_len value is always 1. If the conflicting lock is (0, 4095) the F_GETLK request for (1024, 10) returns (0, 1), which doesn't even cover the requested range, and is quite confusing. The fix is trivial, set fl_end from the fl_end value recieved from the nfs server. Signed-off-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:08:59 -04:00
Chuck Lever	367c8c7bd9	lockd: Pass "struct sockaddr *" to new failover-by-IP function Pass a more generic socket address type to nlmsvc_unlock_all_by_ip() to allow for future support of IPv6. Also provide additional sanity checking in failover_unlock_ip() when constructing the server's IP address. As an added bonus, provide clean kerneldoc comments on related NLM interfaces which were recently added. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-15 16:11:29 -04:00
Ingo Molnar	1a781a777b	Merge branch 'generic-ipi' into generic-ipi-for-linus Conflicts: arch/powerpc/Kconfig arch/s390/kernel/time.c arch/x86/kernel/apic_32.c arch/x86/kernel/cpu/perfctr-watchdog.c arch/x86/kernel/i8259_64.c arch/x86/kernel/ldt.c arch/x86/kernel/nmi_64.c arch/x86/kernel/smpboot.c arch/x86/xen/smp.c include/asm-x86/hw_irq_32.h include/asm-x86/hw_irq_64.h include/asm-x86/mach-default/irq_vectors.h include/asm-x86/mach-voyager/irq_vectors.h include/asm-x86/smp.h kernel/Makefile Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-15 21:55:59 +02:00
J. Bruce Fields	560de0e659	lockd: get host reference in nlmsvc_create_block() instead of callers It may not be obvious (till you look at the definition of nlm_alloc_call()) that a function like nlmsvc_create_block() should consume a reference on success or failure, so I find it clearer if it takes the reference it needs itself. And both callers already do this immediately before the call anyway. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-15 15:40:25 -04:00
J. Bruce Fields	6d7bbbbacc	lockd: minor svclock.c style fixes Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-15 15:28:43 -04:00
Jeff Layton	6cde4de807	lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_lock nlmsvc_lock calls nlmsvc_lookup_host to find a nlm_host struct. The callers of this function, however, call nlmsvc_retrieve_args or nlm4svc_retrieve_args, which also return a nlm_host struct. Change nlmsvc_lock to take a host arg instead of calling nlmsvc_lookup_host itself and change the callers to pass a pointer to the nlm_host they've already found. Since nlmsvc_testlock() now just uses the caller's reference, we no longer need to get or release it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-15 14:53:33 -04:00
Jeff Layton	8f920d5e29	lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_testlock nlmsvc_testlock calls nlmsvc_lookup_host to find a nlm_host struct. The callers of this functions, however, call nlmsvc_retrieve_args or nlm4svc_retrieve_args, which also return a nlm_host struct. Change nlmsvc_testlock to take a host arg instead of calling nlmsvc_lookup_host itself and change the callers to pass a pointer to the nlm_host they've already found. We take a reference to host in the place where nlmsvc_testlock() previous did a new lookup, so the reference counting is unchanged from before. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-15 14:26:52 -04:00
Linus Torvalds	e4e0fadcd9	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6: jfs: remove DIRENTSIZ JFS: diAlloc() should return -EIO rather than EIO JFS: skip bad iput() call in error path JFS: switch to seq_files JFS: 0 is not valid errno value so return NULL from jfs_lookup	2008-07-15 11:03:19 -07:00
Linus Torvalds	38c46578ff	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: [GFS2] Fix GFS2's use of do_div() in its quota calculations [GFS2] Remove unused declaration [GFS2] Remove support for unused and pointless flag [GFS2] Replace rgrp "recent list" with mru list [GFS2] Allow local DF locks when holding a cached EX glock [GFS2] Fix delayed demote race [GFS2] don't call permission() [GFS2] Fix module building [GFS2] Glock documentation [GFS2] Remove all_list from lock_dlm [GFS2] Remove obsolete conversion deadlock avoidance code [GFS2] Remove remote lock dropping code [GFS2] kernel panic mounting volume [GFS2] Revise readpage locking [GFS2] Fix ordering of args for list_add [GFS2] trivial sparse lock annotations [GFS2] No lock_nolock [GFS2] Fix ordering bug in lock_dlm [GFS2] Clean up the glock core	2008-07-15 10:38:46 -07:00
Jeff Layton	b0e92aae15	lockd: nlm_release_host() checks for NULL, caller needn't No need to check for a NULL argument twice. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-15 12:35:20 -04:00
Linus Torvalds	8d2567a620	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (61 commits) ext4: Documention update for new ordered mode and delayed allocation ext4: do not set extents feature from the kernel ext4: Don't allow nonextenst mount option for large filesystem ext4: Enable delalloc by default. ext4: delayed allocation i_blocks fix for stat ext4: fix delalloc i_disksize early update issue ext4: Handle page without buffers in ext4_*_writepage() ext4: Add ordered mode support for delalloc ext4: Invert lock ordering of page_lock and transaction start in delalloc mm: Add range_cont mode for writeback ext4: delayed allocation ENOSPC handling percpu_counter: new function percpu_counter_sum_and_set ext4: Add delayed allocation support in data=writeback mode vfs: add hooks for ext4's delayed allocation support jbd2: Remove data=ordered mode support using jbd buffer heads ext4: Use new framework for data=ordered mode in JBD2 jbd2: Implement data=ordered mode handling via inodes vfs: export filemap_fdatawrite_range() ext4: Fix lock inversion in ext4_ext_truncate() ext4: Invert the locking order of page_lock and transaction start ...	2008-07-15 08:36:38 -07:00
Artem Bityutskiy	0d7eff873c	UBIFS: include to compilation Add UBIFS to Makefile and Kbuild. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-07-15 17:35:24 +03:00
Artem Bityutskiy	1e51764a3c	UBIFS: add new flash file system This is a new flash file system. See http://www.linux-mtd.infradead.org/doc/ubifs.html Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-07-15 17:35:15 +03:00
Linus Torvalds	d1794f2c5b	Merge branch 'bkl-removal' of git://git.lwn.net/linux-2.6 * 'bkl-removal' of git://git.lwn.net/linux-2.6: (146 commits) IB/umad: BKL is not needed for ib_umad_open() IB/uverbs: BKL is not needed for ib_uverbs_open() bf561-coreb: BKL unneeded for open() Call fasync() functions without the BKL snd/PCM: fasync BKL pushdown ipmi: fasync BKL pushdown ecryptfs: fasync BKL pushdown Bluetooth VHCI: fasync BKL pushdown tty_io: fasync BKL pushdown tun: fasync BKL pushdown i2o: fasync BKL pushdown mpt: fasync BKL pushdown Remove BKL from remote_llseek v2 Make FAT users happier by not deadlocking x86-mce: BKL pushdown vmwatchdog: BKL pushdown vmcp: BKL pushdown via-pmu: BKL pushdown uml-random: BKL pushdown uml-mmapper: BKL pushdown ...	2008-07-14 14:48:31 -07:00
Jonathan Corbet	2fceef397f	Merge commit 'v2.6.26' into bkl-removal	2008-07-14 15:29:34 -06:00
Louis Rilling	e752065175	configfs: call drop_link() to cleanup after create_link() failure When allow_link() succeeds but create_link() fails, the subsystem is not informed of the failure. This patch fixes this by calling drop_link() on create_link() failures. Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:16 -07:00
Joel Becker	11c3b79218	configfs: Allow ->make_item() and ->make_group() to return detailed errors. The configfs operations ->make_item() and ->make_group() currently return a new item/group. A return of NULL signifies an error. Because of this, -ENOMEM is the only return code bubbled up the stack. Multiple folks have requested the ability to return specific error codes when these operations fail. This patch adds that ability by changing the ->make_item/group() ops to return an int. Also updated are the in-kernel users of configfs. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:16 -07:00
Louis Rilling	6d8344baee	configfs: Fix failing mkdir() making racing rmdir() fail When fixing the rename() vs rmdir() deadlock, we stopped locking default groups' inodes in configfs_detach_prep(), letting racing mkdir() in default groups proceed concurrently. This enables races like below happen, which leads to a failing mkdir() making rmdir() fail, despite the group to remove having no user-created directory under it in the end. process A: process B: /* PWD=A/B */ mkdir("C") make_item("C") attach_group("C") rmdir("A") detach_prep("A") detach_prep("B") error because of "C" return -ENOTEMPTY attach_group("C/D") error (eg -ENOMEM) return -ENOMEM This patch prevents such scenarii by making rmdir() wait as long as detach_prep() fails because a racing mkdir() is in the middle of attach_group(). To achieve this, mkdir() sets a flag CONFIGFS_USET_IN_MKDIR in parent's configfs_dirent before calling attach_group(), and clears the flag once attach_group() is done. detach_prep() fails with -EAGAIN whenever the flag is hit and returns the guilty inode's mutex so that rmdir() can wait on it. Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:16 -07:00
Louis Rilling	b3e76af874	configfs: Fix deadlock with racing rmdir() and rename() This patch fixes the deadlock between racing sys_rename() and configfs_rmdir(). The idea is to avoid locking i_mutexes of default groups in configfs_detach_prep(), and rely instead on the new configfs_dirent_lock to protect against configfs_dirent's linkage mutations. To ensure that an mkdir() racing with rmdir() will not create new items in a to-be-removed default group, we make configfs_new_dirent() check for the CONFIGFS_USET_DROPPING flag right before linking the new dirent, and return error if the flag is set. This makes racing mkdir()/symlink()/dir_open() fail in places where errors could already happen, resp. in (attach_item()\|attach_group())/create_link()/new_dirent(). configfs_depend() remains safe since it locks all the path from configfs root, and is thus mutually exclusive with rmdir(). An advantage of this is that now detach_groups() unconditionnaly takes the default groups i_mutex, which makes it more consistent with populate_groups(). Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:16 -07:00
Louis Rilling	107ed40bd0	configfs: Make configfs_new_dirent() return error code instead of NULL This patch makes configfs_new_dirent return negative error code instead of NULL, which will be useful in the next patch to differentiate ENOMEM from ENOENT. Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:16 -07:00
Louis Rilling	5301a77da2	configfs: Protect configfs_dirent s_links list mutations Symlinks to a config_item are listed under its configfs_dirent s_links, but the list mutations are not protected by any common lock. This patch uses the configfs_dirent_lock spinlock to add the necessary protection. Note: we should also protect the list_empty() test in configfs_detach_prep() but 1/ the lock should not be released immediately because nothing would prevent the list from being filled after a successful list_empty() test, making the problem tricky, 2/ this will be solved by the rmdir() vs rename() deadlock bugfix. Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:16 -07:00
Louis Rilling	6f61076406	configfs: Introduce configfs_dirent_lock This patch introduces configfs_dirent_lock spinlock to protect configfs_dirent traversals against linkage mutations (add/del/move). This will allow configfs_detach_prep() to avoid locking i_mutexes. Locking rules for configfs_dirent linkage mutations are the same plus the requirement of taking configfs_dirent_lock. For configfs_dirent walking, one can either take appropriate i_mutex as before, or take configfs_dirent_lock. The spinlock could actually be a mutex, but the critical sections are either O(1) or should not be too long (default groups walking in last patch). ChangeLog: - Clarify the comment on configfs_dirent_lock usage - Move sd->s_element init before linking the new dirent - In lseek(), do not release configfs_dirent_lock before the dirent is relinked. Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:15 -07:00
Joel Becker	fe9f387740	ocfs2: Don't snprintf() without a format. Some system files are per-slot. Their names include the slot number. ocfs2_sprintf_system_inode_name() uses the system inode definitions to fill in the slot number with snprintf(). For global system files, there is no node number, and the name was printed as a format with no arguments. -Wformat-nonliteral and -Wformat-security don't like this. Instead, use a static "%s" format and the name as the argument. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:15 -07:00
Joel Becker	e407e39783	ocfs2: Fix CONFIG_OCFS2_DEBUG_FS #ifdefs A couple places use OCFS2_DEBUG_FS where they really mean CONFIG_OCFS2_DEBUG_FS. Reported-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:15 -07:00
Sunil Mushran	461c6a30ec	ocfs2/net: Silence build warnings on sparc64 suseconds_t is type long on most arches except sparc64 where it is type int. This patch silences the following warnings that are generated when building on it. netdebug.c: In function 'nst_seq_show': netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 13 has type 'suseconds_t' netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 15 has type 'suseconds_t' netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 17 has type 'suseconds_t' netdebug.c: In function 'sc_seq_show': netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 19 has type 'suseconds_t' netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 21 has type 'suseconds_t' netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 23 has type 'suseconds_t' netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 25 has type 'suseconds_t' netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 27 has type 'suseconds_t' netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 29 has type 'suseconds_t' Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:15 -07:00
Wengang Wang	01af482037	ocfs2: Handle error during journal load This patch ensures the mount fails if the fs is unable to load the journal. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Acked-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:15 -07:00
Sunil Mushran	56753bd3b9	ocfs2: Silence an error message in ocfs2_file_aio_read() This patch silences an EINVAL error message in ocfs2_file_aio_read() that is always due to a user error. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:15 -07:00
Akinobu Mita	7600c72b75	ocfs2: use simple_read_from_buffer() Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:15 -07:00
Randy Dunlap	dd25e55ea1	ocfs2: fix printk format warnings with OCFS2_FS_STATS=n Fix printk format warnings when OCFS2_FS_STATS=n: linux-next-20080528/fs/ocfs2/dlmglue.c: In function 'ocfs2_dlm_seq_show': linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 3 has type 'int' linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 4 has type 'int' linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 7 has type 'int' linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 8 has type 'int' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:14 -07:00
Sunil Mushran	8ddb7b004d	[PATCH 2/2] ocfs2: Instrument fs cluster locks This patch adds code to track the number of times the fs takes various cluster locks as well as the times associated with it. The information is made available to users via debugfs. This patch was originally written by Jan Kara <jack@suse.cz>. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:14 -07:00
Sunil Mushran	ce7231e92d	[PATCH 1/2] ocfs2: Add CONFIG_OCFS2_FS_STATS config option This patch adds config option CONFIG_OCFS2_FS_STATS to allow building the fs with instrumentation enabled. An upcoming patch will provide support to instrument cluster locking, which is a crucial overhead in a cluster file system. This config option allows users to avoid the cpu and memory overhead that is involved in gathering such statistics. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:14 -07:00
Linus Torvalds	a3da5bf84a	Merge branch 'x86/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (821 commits) x86: make 64bit hpet_set_mapping to use ioremap too, v2 x86: get x86_phys_bits early x86: max_low_pfn_mapped fix #4 x86: change _node_to_cpumask_ptr to return const ptr x86: I/O APIC: remove an IRQ2-mask hack x86: fix numaq_tsc_disable calling x86, e820: remove end_user_pfn x86: max_low_pfn_mapped fix, #3 x86: max_low_pfn_mapped fix, #2 x86: max_low_pfn_mapped fix, #1 x86_64: fix delayed signals x86: remove conflicting nx6325 and nx6125 quirks x86: Recover timer_ack lost in the merge of the NMI watchdog x86: I/O APIC: Never configure IRQ2 x86: L-APIC: Always fully configure IRQ0 x86: L-APIC: Set IRQ0 as edge-triggered x86: merge dwarf2 headers x86: use AS_CFI instead of UNWIND_INFO x86: use ignore macro instead of hash comment x86: use matching CFI_ENDPROC ...	2008-07-14 13:43:24 -07:00
Linus Torvalds	847106ff62	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (25 commits) security: remove register_security hook security: remove dummy module fix security: remove dummy module security: remove unused sb_get_mnt_opts hook LSM/SELinux: show LSM mount options in /proc/mounts SELinux: allow fstype unknown to policy to use xattrs if present security: fix return of void-valued expressions SELinux: use do_each_thread as a proper do/while block SELinux: remove unused and shadowed addrlen variable SELinux: more user friendly unknown handling printk selinux: change handling of invalid classes (Was: Re: 2.6.26-rc5-mm1 selinux whine) SELinux: drop load_mutex in security_load_policy SELinux: fix off by 1 reference of class_to_string in context_struct_compute_av SELinux: open code sidtab lock SELinux: open code load_mutex SELinux: open code policy_rwlock selinux: fix endianness bug in network node address handling selinux: simplify ioctl checking SELinux: enable processes with mac_admin to get the raw inode contexts Security: split proc ptrace checking into read vs. attach ...	2008-07-14 13:36:55 -07:00
Linus Torvalds	dddec01eb8	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (37 commits) splice: fix generic_file_splice_read() race with page invalidation ramfs: enable splice write drivers/block/pktcdvd.c: avoid useless memset cdrom: revert commit `22a9189` (cdrom: use kmalloced buffers instead of buffers on stack) scsi: sr avoids useless buffer allocation block: blk_rq_map_kern uses the bounce buffers for stack buffers block: add blk_queue_update_dma_pad DAC960: push down BKL pktcdvd: push BKL down into driver paride: push ioctl down into driver block: use get_unaligned_* helpers block: extend queue_flag bitops block: request_module(): use format string Add bvec_merge_data to handle stacked devices and ->merge_bvec() block: integrity flags can't use bit ops on unsigned short cmdfilter: extend default read filter sg: fix odd style (extra parenthesis) introduced by cmd filter patch block: add bounce support to blk_rq_map_user_iov cfq-iosched: get rid of enable_idle being unused warning allow userspace to modify scsi command filter on per device basis ...	2008-07-14 13:15:14 -07:00
Benny Halevy	18c60c0a3b	dlm: fix uninitialized variable for search_rsb_list callers gcc 4.3.0 correctly emits the following warning. search_rsb_list does not r_ret if no dlm_rsb is found and _search_rsb may pass the uninitialized value upstream on the error path when both calls to search_rsb_list return non-zero error. The fix sets r_ret to NULL on search_rsb_list's not-found path. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: David Teigland <teigland@redhat.com>	2008-07-14 13:56:59 -05:00
Masatake YAMATO	311f6fc77c	dlm: release socket on error It seems that `sock' allocated by sock_create_kern in tcp_connect_to_sock() of dlm/fs/lowcomms.c is not released if dlm_nodeid_to_addr an error. Acked-by: Christine Caulfield <ccaulfie@redhat.com> Signed-off-by: Masatake YAMATO <yamato@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2008-07-14 13:56:59 -05:00
David Teigland	329fc4c372	dlm: fix basts for granted CW waiting PR/CW The fix in commit `3650925893` was addressing the case of a granted PR lock with waiting PR and CW locks. It's a special case that requires forcing a CW bast. However, that forced CW bast was incorrectly applying to a second condition where the granted lock was CW. So, the holder of a CW lock could receive an extraneous CW bast instead of a PR bast. This fix narrows the original special case to what was intended. Signed-off-by: David Teigland <teigland@redhat.com>	2008-07-14 13:56:59 -05:00
Masatake YAMATO	254ae43ab8	dlm: check for null in device_write If `device_write' method is called via "dlm-control", file->private_data is NULL. (See ctl_device_open() in user.c. ) Through proc->flags is read. Signed-off-by: Masatake YAMATO <yamato@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2008-07-14 13:56:59 -05:00
Marcel Holtmann	40be492fe4	[Bluetooth] Export details about authentication requirements With the Simple Pairing support, the authentication requirements are an explicit setting during the bonding process. Track and enforce the requirements and allow higher layers like L2CAP and RFCOMM to increase them if needed. This patch introduces a new IOCTL that allows to query the current authentication requirements. It is also possible to detect Simple Pairing support in the kernel this way. Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2008-07-14 20:13:50 +02:00
Artem Bityutskiy	4ee6afd344	VFS: export sync_sb_inodes This patch exports the 'sync_sb_inodes()' which is needed for UBIFS because it has to force write-back from time to time. Namely, the UBIFS budgeting subsystem forces write-back when its pessimistic callculations show that there is no free space on the media. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-07-14 19:10:52 +03:00
Hans Reiser	ae8547b0a9	VFS: move inode_lock into sync_sb_inodes This patch makes 'sync_sb_inodes()' lock 'inode_lock', rather than expect that the caller will do this. This change was previously done by Hans Reiser <reiser@namesys.com> and sat in the -mm tree. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-07-14 19:10:52 +03:00
Ingo Molnar	d59fdcf2ac	Merge commit 'v2.6.26' into x86/core	2008-07-14 11:37:46 +02:00
Eric Paris	2069f45784	LSM/SELinux: show LSM mount options in /proc/mounts This patch causes SELinux mount options to show up in /proc/mounts. As with other code in the area seq_put errors are ignored. Other LSM's will not have their mount options displayed until they fill in their own security_sb_show_options() function. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: James Morris <jmorris@namei.org>	2008-07-14 15:02:05 +10:00
Stephen Smalley	006ebb40d3	Security: split proc ptrace checking into read vs. attach Enable security modules to distinguish reading of process state via proc from full ptrace access by renaming ptrace_may_attach to ptrace_may_access and adding a mode argument indicating whether only read access or full attach access is requested. This allows security modules to permit access to reading process state without granting full ptrace access. The base DAC/capability checking remains unchanged. Read access to /proc/pid/mem continues to apply a full ptrace attach check since check_mem_permission() already requires the current task to already be ptracing the target. The other ptrace checks within proc for elements like environ, maps, and fds are changed to pass the read mode instead of attach. In the SELinux case, we model such reading of process state as a reading of a proc file labeled with the target process' label. This enables SELinux policy to permit such reading of process state without permitting control or manipulation of the target process, as there are a number of cases where programs probe for such information via proc but do not need to be able to control the target (e.g. procps, lsof, PolicyKit, ConsoleKit). At present we have to choose between allowing full ptrace in policy (more permissive than required/desired) or breaking functionality (or in some cases just silencing the denials via dontaudit rules but this can hide genuine attacks). This version of the patch incorporates comments from Casey Schaufler (change/replace existing ptrace_may_attach interface, pass access mode), and Chris Wright (provide greater consistency in the checking). Note that like their predecessors __ptrace_may_attach and ptrace_may_attach, the __ptrace_may_access and ptrace_may_access interfaces use different return value conventions from each other (0 or -errno vs. 1 or 0). I retained this difference to avoid any changes to the caller logic but made the difference clearer by changing the latter interface to return a bool rather than an int and by adding a comment about it to ptrace.h for any future callers. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: James Morris <jmorris@namei.org>	2008-07-14 15:01:47 +10:00
Jeff Layton	536abdb080	cifs: fix wksidarr declaration to be big-endian friendly The current definition of wksidarr works fine on little endian arches (since cpu_to_le32 is a no-op there), but on big-endian arches, it fails to compile with this error: error: braced-group within expression allowed only inside a function The problem is that this static declaration has cpu_to_le32 embedded within it, and that expands into a function macro. We need to use __constant_cpu_to_le32() instead. Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Steven French <sfrench@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-12 14:33:42 -07:00
Jeff Layton	e911d0cc87	cifs: fix inode leak in cifs_get_inode_info_unix Try this: mount a share with unix extensions create a file on it umount the share You'll get the following message in the ring buffer: VFS: Busy inodes after unmount of cifs. Self-destruct in 5 seconds. Have a nice day... ...the problem is that cifs_get_inode_info_unix is creating and hashing a new inode even when it's going to return error anyway. The first lookup when creating a file returns an error so we end up leaking this inode before we do the actual create. This appears to be a regression caused by commit `0e4bbde94f`. The following patch seems to fix it for me, and fixes a minor formatting nit as well. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-12 14:33:42 -07:00
Ingo Molnar	ae94b8075a	Merge branch 'linus' into x86/core Conflicts: arch/x86/mm/ioremap.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-12 07:29:02 +02:00
Eric Sandeen	e4079a11f5	ext4: do not set extents feature from the kernel We've talked for a while about getting rid of any feature- setting from the kernel; this gets rid of the code which would set the INCOMPAT_EXTENTS flag on the first file write when mounted as ext4[dev]. With this patch, if the extents feature is not already set on disk, then mounting as ext4 will fall back to noextents with a warning, and if -o extents is explicitly requested, the mount will fail, also with warning. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	c07651b556	ext4: Don't allow nonextenst mount option for large filesystem The block mapped inode format can address only blocks within 232. This causes a number of issues, the biggest of which is that the block allocator needs to be taught that certain inodes can not utilize block numbers > 232. So until this is fixed, it is simplest to fail mounting of file systems with more than 2**32 blocks if the -o noextents option is given. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	dd919b9822	ext4: Enable delalloc by default. Enable delalloc by default to ensure it gets sufficient testing and because it makes the filesystem much more efficient. Add a nodealalloc option to disable delayed allocation, and update ext4_show_options to show delayed allocation off if it is disabled. If the data=journal mount option is used, disable delayed allocation since the delalloc code doesn't support data=journal yet. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Mingming Cao <cmm@us.ibm.com>	2008-07-11 19:27:31 -04:00
Mingming Cao	3e3398a08d	ext4: delayed allocation i_blocks fix for stat Right now i_blocks is not getting updated until the blocks are actually allocaed on disk. This means with delayed allocation, right after files are copied, "ls -sF" shoes the file as taking 0 blocks on disk. "du" also shows the files taking zero space, which is highly confusing to the user. Since delayed allocation already keeps track of per-inode total number of blocks that are subject to delayed allocation, this patch fix this by using that to adjust the value returned by stat(2). When real block allocation is done, the i_blocks will get updated. Since the reserved blocks for delayed allocation will be decreased, this will be keep value returned by stat(2) consistent. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	632eaeab1f	ext4: fix delalloc i_disksize early update issue Ext4_da_write_end() used walk_page_buffers() with a callback function of ext4_bh_unmapped_or_delay() to check if it extended the file size without allocating any blocks (since in this case i_disksize needs to be updated). However, this is didn't work proprely because the buffer head has not been marked dirty yet --- this is done later in block_commit_write() --- which caused ext4_bh_unmapped_or_delay() to always return false. In addition, walk_page_buffers() checks all of the buffer heads covering the page, and the only buffer_head that should be checked is the one covering the end of the write. Otherwise, given a 1k blocksize filesystem and a 4k page size, the buffer head covering the first 1k stripe of the file could be unmapped (because it was a sparse file), and the second or third buffer_head covering that page could be mapped, and using walk_page_buffers() would fail in this case since it would stop at the first unmapped buffer_head and return true. The core problem is that walk_page_buffers() was intended to do work in a callback function, and a non-zero return value indicated a failure, which termined the walk of the buffer heads covering the page. It was not intended to be used with a boolean function, such as ext4_bh_unmapped_or_delay(). Add addtional fix from Aneesh to protect i_disksize update rave with truncate. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	f0e6c98593	ext4: Handle page without buffers in ext4__writepage() It can happen that buffers are removed from the page before it gets marked dirty and then is passed to writepage(). In writepage() we just initialize the buffers and check whether they are mapped and non delay. If they are mapped and non delay we write the page. Otherwise we mark them dirty. With this change we don't do block allocation at all in ext4__write_page. writepage() can get called under many condition and with a locking order of journal_start -> lock_page, we should not try to allocate blocks in writepage() which get called after taking page lock. writepage() can get called via shrink_page_list even with a journal handle which was created for doing inode update. For example when doing ext4_da_write_begin we create a journal handle with credit 1 expecting a i_disksize update for the inode. But ext4_da_write_begin can cause shrink_page_list via _grab_page_cache. So having a valid handle via ext4_journal_current_handle is not a guarantee that we can use the handle for block allocation in writepage, since we shouldn't be using credits that had been reserved for other updates. That it could result in we running out of credits when we update inodes. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	cd1aac3292	ext4: Add ordered mode support for delalloc This provides a new ordered mode implementation which gets rid of using buffer heads to enforce the ordering between metadata change with the related data chage. Instead, in the new ordering mode, it keeps track of all of the inodes touched by each transaction on a list, and when that transaction is committed, it flushes all of the dirty pages for those inodes. In addition, the new ordered mode reverses the lock ordering of the page lock and transaction lock, which provides easier support for delayed allocation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	61628a3f3a	ext4: Invert lock ordering of page_lock and transaction start in delalloc With the reverse locking, we need to start a transation before taking the page lock, so in ext4_da_writepages() we need to break the write-out into chunks, and restart the journal for each chunck to ensure the write-out fits in a single transaction. Updated patch from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> which fixes delalloc sync hang with journal lock inversion, and address the performance regression issue. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	d2a1763791	ext4: delayed allocation ENOSPC handling This patch does block reservation for delayed allocation, to avoid ENOSPC later at page flush time. Blocks(data and metadata) are reserved at da_write_begin() time, the freeblocks counter is updated by then, and the number of reserved blocks is store in per inode counter. At the writepage time, the unused reserved meta blocks are returned back. At unlink/truncate time, reserved blocks are properly released. Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to fix the oldallocator block reservation accounting with delalloc, added lock to guard the counters and also fix the reservation for meta blocks. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-14 17:52:37 -04:00
Mingming Cao	e8ced39d5e	percpu_counter: new function percpu_counter_sum_and_set Delayed allocation need to check free blocks at every write time. percpu_counter_read_positive() is not quit accurate. delayed allocation need a more accurate accounting, but using percpu_counter_sum_positive() is frequently is quite expensive. This patch added a new function to update center counter when sum per-cpu counter, to increase the accurate rate for next percpu_counter_read() and require less calling expensive percpu_counter_sum(). Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Alex Tomas	64769240bd	ext4: Add delayed allocation support in data=writeback mode Updated with fixes from Mingming Cao <cmm@us.ibm.com> to unlock and release the page from page cache if the delalloc write_begin failed, and properly handle preallocated blocks. Also added a fix to clear buffer_delay in block_write_full_page() after allocating a delayed buffer. Updated with fixes from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to update i_disksize properly and to add bmap support for delayed allocation. Updated with a fix from Valerie Clement <valerie.clement@bull.net> to avoid filesystem corruption when the filesystem is mounted with the delalloc option and blocksize < pagesize. Signed-off-by: Alex Tomas <alex@clusterfs.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2008-07-11 19:27:31 -04:00
Alex Tomas	29a814d2ee	vfs: add hooks for ext4's delayed allocation support Export mpage_bio_submit() and __mpage_writepage() for the benefit of ext4's delayed allocation support. Also change __block_write_full_page so that if buffers that have the BH_Delay flag set it will call get_block() to get the physical block allocated, just as in the !BH_Mapped case. Signed-off-by: Alex Tomas <alex@clusterfs.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jan Kara	87c89c232c	jbd2: Remove data=ordered mode support using jbd buffer heads Signed-off-by: Jan Kara <jack@suse.cz>	2008-07-11 19:27:31 -04:00
Jan Kara	678aaf4814	ext4: Use new framework for data=ordered mode in JBD2 This patch makes ext4 use inode-based implementation of data=ordered mode in JBD2. It allows us to unify some data=ordered and data=writeback paths (especially writepage since we don't have to start a transaction anymore) and remove some buffer walking. Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to fix file system hang due to corrupt jinode values. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jan Kara	c851ed5401	jbd2: Implement data=ordered mode handling via inodes This patch adds necessary framework into JBD2 to be able to track inodes with each transaction and write-out their dirty data during transaction commit time. This new ordered mode brings all sorts of advantages such as possibility to get rid of journal heads and buffer heads for data buffers in ordered mode, better ordering of writes on transaction commit, simplification of some JBD code, no more anonymous pages when truncate of data being committed happens. Also with this new ordered mode, delayed allocation on ordered mode is much simpler. Signed-off-by: Jan Kara <jack@suse.cz>	2008-07-11 19:27:31 -04:00
Jan Kara	9ddfc3dc75	ext4: Fix lock inversion in ext4_ext_truncate() We cannot call ext4_orphan_add() from under i_data_sem because that causes a lock ordering violation between i_data_sem and and the superblock lock. Updated with Aneesh's locking order fix Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jan Kara	cf108bca46	ext4: Invert the locking order of page_lock and transaction start This changes are needed to support data=ordered mode handling via inodes. This enables us to get rid of the journal heads and buffer heads for data buffers in the ordered mode. With the changes, during tranasaction commit we writeout the inode pages using the writepages()/writepage(). That implies we take page lock during transaction commit. This can cause a deadlock with the locking order page_lock -> jbd2_journal_start, since the jbd2_journal_start can wait for the journal_commit to happen and the journal_commit now needs to take the page lock. To avoid this dead lock reverse the locking order. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jan Kara	c7d206b337	vfs: Move mark_inode_dirty() from under page lock in generic_write_end() There's no need to call mark_inode_dirty() under page lock in generic_write_end(). It unnecessarily makes hold time of page lock longer and more importantly it forces locking order of page lock and transaction start for journaling filesystems. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	2e9ee85035	ext4: Use page_mkwrite vma_operations to get mmap write notification. We would like to get notified when we are doing a write on mmap section. This is needed with respect to preallocated area. We split the preallocated area into initialzed extent and uninitialzed extent in the call back. This let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and that would result in data loss. The changes are also needed to handle ENOSPC when writing to an mmap section of files with holes. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Frederic Bohe	5f21b0e642	ext4: fix online resize with mballoc Update group infos when updating a group's descriptor. Add group infos when adding a group's descriptor. Refresh cache pages used by mb_alloc when changes occur. This will probably need modifications when META_BG resizing will be allowed. Signed-off-by: Frederic Bohe <frederic.bohe@bull.net> Signed-off-by: Mingming Cao <cmm@us.ibm.com>	2008-07-11 19:27:31 -04:00

... 3 4 5 6 7 ...

9770 Commits