OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Zheng Liu	4a3c3a5120	ext4: fix format flag in ext4_ext_binsearch_idx() fix ext_debug format flag in ext4_ext_binsearch_idx(). Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-05-28 17:55:16 -04:00
Zheng Liu	400db9d301	ext4: cleanup in ext4_discard_allocated_blocks() remove 'len' variable in ext4_discard_allocated_blocks() because it is useless. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-05-28 17:53:53 -04:00
Theodore Ts'o	2cde417de0	ext4: return ENOMEM when mounts fail due to lack of memory This is a port of the ext3 commit: `4569cd1b0d` Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-05-28 17:49:54 -04:00
Theodore Ts'o	2716b80284	ext4: remove redundundant "(char ) bh->b_data" casts The b_data field of the buffer_head is already a char , so there's no point casting it to a char *. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-05-28 17:47:52 -04:00
Andreas Dilger	7e936b7372	ext4: disallow hard-linked directory in ext4_lookup A hard-linked directory to its parent can cause the VFS to deadlock, and is a sign of a corrupted file system. So detect this case in ext4_lookup(), before the rmdir() lockup scenario can take place. Signed-off-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2012-05-28 17:02:25 -04:00
Haogang Chen	967ac8af44	ext4: fix potential integer overflow in alloc_flex_gd() In alloc_flex_gd(), when flexbg_size is large, kmalloc size would overflow and flex_gd->groups would point to a buffer smaller than expected, causing OOB accesses when it is used. Note that in ext4_resize_fs(), flexbg_size is calculated using sbi->s_log_groups_per_flex, which is read from the disk and only bounded to [1, 31]. The patch returns NULL for too large flexbg_size. Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Haogang Chen <haogangchen@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2012-05-28 14:21:55 -04:00
Akira Fujita	9d99012ff2	ext4: remove needs_recovery in ext4_mb_init() needs_recovery in ext4_mb_init() is not used, remove it. Signed-off-by: Akira Fujita <a-fujita@rs.jp.ne.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-05-28 14:19:25 -04:00
Eric Sandeen	7e84b62164	ext4: force ro mount if ext4_setup_super() fails If ext4_setup_super() fails i.e. due to a too-high revision, the error is logged in dmesg but the fs is not mounted RO as indicated. Tested by: # mkfs.ext4 -r 4 /dev/sdb6 # mount /dev/sdb6 /mnt/test # dmesg \| grep "too high" [164919.759248] EXT4-fs (sdb6): revision level too high, forcing read-only mode # grep sdb6 /proc/mounts /dev/sdb6 /mnt/test2 ext4 rw,seclabel,relatime,data=ordered 0 0 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2012-05-28 14:17:25 -04:00
Dan Carpenter	bb3d132a24	ext4: fix potential NULL dereference in ext4_free_inodes_counts() The ext4_get_group_desc() function returns NULL on error, and ext4_free_inodes_count() function dereferences it without checking. There is a check on the next line, but it's too late. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2012-05-28 14:16:57 -04:00
Linus Torvalds	90324cc1b1	avoid iput() from flusher thread -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJPw2J/AAoJECvKgwp+S8Ja5jkP/3uMxkhf8XQpXCI3O1QVfaQr uZFfM8sINqIPDVm1dtFjFj7f8Bw9mhE2KAnnJ1rKT8tQwqq9yAse1QPlhCG1ZqoP +AnMDDXHtx7WmQZXhBvS9b+unpZ7Jr6r6pO5XrmTL2kRL3YJPUhZ2+xbTT5belTB KoAu4WqORZRxfXoC76S7U8K+D4NcAGhAOxCClsIjmY+oocCiCag4FZOyzYIFViqc ghUN/+rLQ3fqGGv2yO7Ylx1gUM7sxIwkZQ/h962jFAtxz9czImr2NmRoMliOaOkS tvcnIf+E3u0n/zIjzFvzhxKgHJPP8PkcPMk60d3jKmFngBkqFTzNUeVTP8md7HrV 4DlXisWr+z7YVyWUCFaNcJLmjiWSwQ8DV/clRLobeBf9EJKan5F1PjFgl6PLJM5F Qr1+LHMNaetdulBwMRTyveZTzYqw9RmDnD9dWMo4mX/kTpvtC4jTPVV7hkRD+Qlv 5vTRR+VXL3Q50yClLf0AQMSKTnH2gBuepM/b+7cShLGfsMln8DtUjmbigv+niL63 BibcCIbIlP2uWGnl37VhsC34AT+RKt3lggrBOpn/7XJMq/wKR7IRP/7V9TfYgaUN NBa+wtnLDa1pZEn/X7izdcQP62PzDtmB+ObvYT0Yb40A4+2ud3qF/lB53c1A1ewF /9c4zxxekjHZnn2oooEa =oLXf -----END PGP SIGNATURE----- Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback tree from Wu Fengguang: "Mainly from Jan Kara to avoid iput() in the flusher threads." * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Avoid iput() from flusher thread vfs: Rename end_writeback() to clear_inode() vfs: Move waiting for inode writeback from end_writeback() to evict_inode() writeback: Refactor writeback_single_inode() writeback: Remove wb->list_lock from writeback_single_inode() writeback: Separate inode requeueing after writeback writeback: Move I_DIRTY_PAGES handling writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() writeback: Move clearing of I_SYNC into inode_sync_complete() writeback: initialize global_dirty_limit fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds mm: page-writeback.c: local functions should not be exposed globally	2012-05-28 09:54:45 -07:00
Darrick J. Wong	e93376c20b	ext4/jbd2: add metadata checksumming to the list of supported features Activate the metadata checksumming feature by adding it to ext4 and jbd2's lists of supported features. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-05-27 08:12:42 -04:00
Darrick J. Wong	25ed6e8a54	jbd2: enable journal clients to enable v2 checksumming Add in the necessary code so that journal clients can enable the new journal checksumming features. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-05-27 07:48:56 -04:00
Linus Torvalds	ece78b7df7	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2, ext3 and quota fixes from Jan Kara: "Interesting bits are: - removal of a special i_mutex locking subclass (I_MUTEX_QUOTA) since quota code does not need i_mutex anymore in any unusual way. - backport (from ext4) of a fix of a checkpointing bug (missing cache flush) that could lead to fs corruption on power failure The rest are just random small fixes & cleanups." * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: trivial fix to comment for ext2_free_blocks ext2: remove the redundant comment for ext2_export_ops ext3: return 32/64-bit dir name hash according to usage type quota: Get rid of nested I_MUTEX_QUOTA locking subclass quota: Use precomputed value of sb_dqopt in dquot_quota_sync ext2: Remove i_mutex use from ext2_quota_write() reiserfs: Remove i_mutex use from reiserfs_quota_write() ext4: Remove i_mutex use from ext4_quota_write() ext3: Remove i_mutex use from ext3_quota_write() quota: Fix double lock in add_dquot_ref() with CONFIG_QUOTA_DEBUG jbd: Write journal superblock with WRITE_FUA after checkpointing jbd: protect all log tail updates with j_checkpoint_mutex jbd: Split updating of journal superblock and marking journal empty ext2: do not register write_super within VFS ext2: Remove s_dirt handling ext2: write superblock only once on unmount ext3: update documentation with barrier=1 default ext3: remove max_debt in find_group_orlov() jbd: Refine commit writeout logic	2012-05-25 08:14:59 -07:00
Linus Torvalds	644473e9c6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace enhancements from Eric Biederman: "This is a course correction for the user namespace, so that we can reach an inexpensive, maintainable, and reasonably complete implementation. Highlights: - Config guards make it impossible to enable the user namespace and code that has not been converted to be user namespace safe. - Use of the new kuid_t type ensures the if you somehow get past the config guards the kernel will encounter type errors if you enable user namespaces and attempt to compile in code whose permission checks have not been updated to be user namespace safe. - All uids from child user namespaces are mapped into the initial user namespace before they are processed. Removing the need to add an additional check to see if the user namespace of the compared uids remains the same. - With the user namespaces compiled out the performance is as good or better than it is today. - For most operations absolutely nothing changes performance or operationally with the user namespace enabled. - The worst case performance I could come up with was timing 1 billion cache cold stat operations with the user namespace code enabled. This went from 156s to 164s on my laptop (or 156ns to 164ns per stat operation). - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value. Most uid/gid setting system calls treat these value specially anyway so attempting to use -1 as a uid would likely cause entertaining failures in userspace. - If setuid is called with a uid that can not be mapped setuid fails. I have looked at sendmail, login, ssh and every other program I could think of that would call setuid and they all check for and handle the case where setuid fails. - If stat or a similar system call is called from a context in which we can not map a uid we lie and return overflowuid. The LFS experience suggests not lying and returning an error code might be better, but the historical precedent with uids is different and I can not think of anything that would break by lying about a uid we can't map. - Capabilities are localized to the current user namespace making it safe to give the initial user in a user namespace all capabilities. My git tree covers all of the modifications needed to convert the core kernel and enough changes to make a system bootable to runlevel 1." Fix up trivial conflicts due to nearby independent changes in fs/stat.c * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) userns: Silence silly gcc warning. cred: use correct cred accessor with regards to rcu read lock userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq userns: Convert cgroup permission checks to use uid_eq userns: Convert tmpfs to use kuid and kgid where appropriate userns: Convert sysfs to use kgid/kuid where appropriate userns: Convert sysctl permission checks to use kuid and kgids. userns: Convert proc to use kuid/kgid where appropriate userns: Convert ext4 to user kuid/kgid where appropriate userns: Convert ext3 to use kuid/kgid where appropriate userns: Convert ext2 to use kuid/kgid where appropriate. userns: Convert devpts to use kuid/kgid where appropriate userns: Convert binary formats to use kuid/kgid where appropriate userns: Add negative depends on entries to avoid building code that is userns unsafe userns: signal remove unnecessary map_cred_ns userns: Teach inode_capable to understand inodes whose uids map to other namespaces. userns: Fail exec for suid and sgid binaries with ids outside our user namespace. userns: Convert stat to return values mapped from kuids and kgids userns: Convert user specfied uids and gids in chown into kuids and kgid userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs ...	2012-05-23 17:42:39 -07:00
Theodore Ts'o	f32aaf2d2b	ext4: enable the 64-bit jbd2 feature based on the 64-bit ext4 feature Previously we were only enabling the 64-bit jbd2 feature if the number of blocks in the file system was greater 232-1. The problem with this is that it makes it harder to test the 64-bit journal code paths with small file systems, since a small test file system would with the 64-bit ext4 feature enable would use a 64-bit file system on-disk data structures, but use a 32-bit journal. This would also cause problems when trying to do an online resize to grow the filesystem above the 232-1 boundary. Fortunately the patch to support online resize for 64-bit file systems hasn't been merged yet, so this problem hasn't arisen in practice. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-05-21 11:42:02 -04:00
Eric W. Biederman	08cefc7ab8	userns: Convert ext4 to user kuid/kgid where appropriate Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-05-15 14:59:27 -07:00
Jan Kara	0b7f7cefae	ext4: Remove i_mutex use from ext4_quota_write() We don't need i_mutex in ext4_quota_write() because writes to quota file are serialized by dqio_mutex anyway. Changes to quota files outside of quota code are forbidded and enforced by NOATIME and IMMUTABLE bits. Signed-off-by: Jan Kara <jack@suse.cz>	2012-05-15 23:34:38 +02:00
Linus Torvalds	26fe575028	vfs: make it possible to access the dentry hash/len as one 64-bit entry This allows comparing hash and len in one operation on 64-bit architectures. Right now only __d_lookup_rcu() takes advantage of this, since that is the case we care most about. The use of anonymous struct/unions hides the alternate 64-bit approach from most users, the exception being a few cases where we initialize a 'struct qstr' with a static initializer. This makes the problematic cases use a new QSTR_INIT() helper function for that (but initializing just the name pointer with a "{ .name = xyzzy }" initializer remains valid, as does just copying another qstr structure). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-05-10 19:54:35 -07:00
Jan Kara	dbd5768f87	vfs: Rename end_writeback() to clear_inode() After we moved inode_sync_wait() from end_writeback() it doesn't make sense to call the function end_writeback() anymore. Rename it to clear_inode() which well says what the function really does - set I_CLEAR flag. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>	2012-05-06 13:43:41 +08:00
Theodore Ts'o	b09de7fa52	ext4: remove unnecessary check in add_dirent_to_buf() None of this function callers ever pass in a NULL inode pointer, so this check is unnecessary, and the else clause is dead code. (This change should make the code coverage people a little happier. :-) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-30 07:40:00 -04:00
Darrick J. Wong	5c359a47e7	ext4: add checksums to the MMP block Compute and verify a checksum for the MMP block. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:47:10 -04:00
Darrick J. Wong	feb0ab32a5	ext4: make block group checksums use metadata_csum algorithm metadata_csum supersedes uninit_bg. Convert the ROCOMPAT uninit_bg flag check to a helper function that covers both, and make the checksum calculation algorithm use either crc16 or the metadata_csum chosen algorithm depending on which flag is set. Print a warning if we try to mount a filesystem with both feature flags set. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:45:10 -04:00
Darrick J. Wong	cc8e94fd12	ext4: Calculate and verify checksums of extended attribute blocks Calculate and verify the checksums of extended attribute blocks. This only applies to separate EA blocks that are pointed to by inode->i_file_acl (i.e. external EA blocks); the checksum lives in the EA header. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:43:10 -04:00
Darrick J. Wong	b0336e8d21	ext4: calculate and verify checksums of directory leaf blocks Calculate and verify the checksums for directory leaf blocks (i.e. blocks that only contain actual directory entries). The checksum lives in what looks to be an unused directory entry with a 0 name_len at the end of the block. This scheme is not used for internal htree nodes because the mechanism in place there only costs one dx_entry, whereas the "empty" directory entry would cost two dx_entries. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:41:10 -04:00
Darrick J. Wong	dbe8944404	ext4: Calculate and verify checksums for htree nodes Calculate and verify the checksum for directory index tree (htree) node blocks. The checksum is stored in the last 4 bytes of the htree block and requires the dx_entry array to stop 1 dx_entry short of the end of the block. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:39:10 -04:00
Darrick J. Wong	7ac5990d5a	ext4: verify and calculate checksums for extent tree blocks Calculate and verify the checksum for each extent tree block. The checksum is located in the space immediately after the last possible ext4_extent in the block. The space is is typically the last 4-8 bytes in the block. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:37:10 -04:00
Darrick J. Wong	fa77dcfafe	ext4: calculate and verify block bitmap checksum Compute and verify the checksum of the block bitmap; this checksum is stored in the block group descriptor. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:35:10 -04:00
Darrick J. Wong	41a246d1ff	ext4: calculate and verify checksums for inode bitmaps Compute and verify the checksum of the inode bitmap; the checkum is stored in the block group descriptor. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:33:10 -04:00
Darrick J. Wong	814525f4df	ext4: calculate and verify inode checksums This patch introduces to ext4 the ability to calculate and verify inode checksums. This requires the use of a new ro compatibility flag and some accompanying e2fsprogs patches to provide the relevant features in tune2fs and e2fsck. The inode generation changes have been integrated into this patch. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:31:10 -04:00
Darrick J. Wong	a9c4731780	ext4: calculate and verify superblock checksum Calculate and verify the superblock checksum. Since the UUID and block group number are embedded in each copy of the superblock, we need only checksum the entire block. Refactor some of the code to eliminate open-coding of the checksum update call. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:29:10 -04:00
Darrick J. Wong	0441984a33	ext4: load the crc32c driver if necessary Obtain a reference to the cryptoapi and crc32c if we mount a filesystem with metadata checksumming enabled. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:27:10 -04:00
Darrick J. Wong	d25425f8e0	ext4: record the checksum algorithm in use in the superblock Record the type of checksum algorithm we're using for metadata in the superblock, in case we ever want/need to change the algorithm. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:25:10 -04:00
Darrick J. Wong	e615391896	ext4: change on-disk layout to support extended metadata checksumming Define flags and change structure definitions to allow checksumming of ext4 metadata. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:23:10 -04:00
Darrick J. Wong	f84891289e	ext4: create a new BH_Verified flag to avoid unnecessary metadata validation Create a new BH_Verified flag to indicate that we've verified all the data in a buffer_head for correctness. This allows us to bypass expensive verification steps when they are not necessary without missing them when they are. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-04-29 18:21:10 -04:00
Linus Torvalds	95f7147274	Ext4 bug fixes for 3.4 These are two low-risk bug fixes for ext4, fixing a compile warning and a potential deadlock. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABCAAGBQJPlgZ8AAoJENNvdpvBGATwewkP/ioo2U05O4tzmt05+HICw1ZK vh1x6oaO3bUMa21pKBzS60rDc+EDu61E+bjVrsasOmom8DZyOP92SiwaDnIsKn6p JBSNwzIOPmuPflEY3tnOsnOZ1umZcB16uhki1Rk1HE0nRPdKiyKJKZnbSzmUGWUW gJwHbHddxZKTmDrEy4CxfbwwKKVm2SQUO5crLohFst4JsXc1h6muEfkcAZvCfZ68 1PQIkTkJUXArQuTuxzP89r7L8tqHJv4iOz+PT0FlluGWvgJUWIOVvjdJfPuQTmLi UNzvtoQxuxjdZuCK/D16kNTkOEPzOhMlNW1djAntdCQohHIJG0Hd5bFju9bybSLz 838sTCEFxRS7rdBEXiksWsPCVDz/QVnPft0RG9jqXd6dRPFr/XJ1rAeDTjW2vmWw ZO28p99aolA5At02AlSf9IgMIME0gKejnvpRo703UW456BlFIXPK3e/nbtE7Eb5A HcZhvIwncWE4cbq2/AboielPSnyx6Z3SJS0hBIQ2wG40xcL/jxYL7K2/trkUr2KH H3/4RsrSlLDXqHRJ4cVW75zKMgyNvc+60HDlAxE62LqKFR7K93hdlHpnkySy/1St FaIiipH8Tmt+u6tqn6rlR82vRxd/dkLgQMpCWm4Et4THXlvisZkbxaDXrEGx79qg v8eEdmHeJuLcQesm9TrS =Ygid -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bug fixes from Ted Ts'o: "These are two low-risk bug fixes for ext4, fixing a compile warning and a potential deadlock." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: super.c: unused variable warning without CONFIG_QUOTA jbd2: use GFP_NOFS for blkdev_issue_flush	2012-04-23 19:52:00 -07:00
Eldad Zack	db7e5c668e	super.c: unused variable warning without CONFIG_QUOTA sb info is only checked with quota support. fs/ext4/super.c: In function ‘parse_options’: fs/ext4/super.c:1600:23: warning: unused variable ‘sbi’ [-Wunused-variable] Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2012-04-23 21:44:41 -04:00
Linus Torvalds	592fe89806	Ext4 regression fixes for 3.4 This fixes a scalability problem reported by Andi Kleen and Tim Chen; they were quite secretive about the precise nature of their workload, but they later admitted that it only showed up when they were using a large sparse file, so the amount of data I/O that was needed was close to zero. I'm not sure how realistic this is and it's only a regression if you consider changes made since 2.6.39 to be a "regression" vis-a-vis the policy regarding post-merge window bug fixes, but Linus agreed it was worth fixing, so I'm including it in this pull request. This also fixes the journalled quota mount options, which I accidentally broke while I was cleaning up the mount option handling. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABCAAGBQJPjbs2AAoJENNvdpvBGATw/4UQAOMsxRlzbMnOAmohIBmesJiB nTPX41dNw0QCXstMFjvSyRbUJI0NZPGg6ZDhbtoQ/c42/izZOVNd7gh7QQYHRCt2 Oqh6WS159P04ixcAe8NgCm5B5AV/C5Er/vSOUZENBaBo430vvrZasifsyUgQnx+b PxXlYsfPVzaQVeSxCu/68OeiBRLcLwmKZ6MicOaWt9GCNoCsWzgU+/LNskYuscPI 841yQL6/BE7redU2E9qoEn/xjxx57hJOj2iiIAuqGPmNmRQqq3VtvTqNZHldNBLp Hz5mB3zSZsPg0uwvp+OxrnpP37NQCCn1L64UFIXxqGF47mcCYw7schAoGBtqwGQS neGUfkzG4beKk7kojyDawvnrrvVn4iCMaIkR1ZjzjPOk+QBPagckrS2nmuObbYzJ l4lmHq1v8nOh0clxqJPioNp5/Y13sbpEOY4tAa6sLpzKLKXF330RNuwwrKKHB7zo ZhCvSwVEjmxacgPCPhFJnR3fxtjoXWR8WvJs7H+gZB/QaC8hjEYw6xvvrkw8mAiS RNe3cYdYAz6kOJdtJrJaMzp/1CYdHydf+0WvNYCQ/d1poPr7uU5wQY61hdm2gFxh owsbVAOiEFjZWJHqrRRdcg2irTpINJTS3iRe4g/ltcvYzzRSeOcZNWvkKFspq3i8 EUMHRBxLzPMRa+gU6Unm =jb3f -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 regression fixes from Ted Ts'o: "This fixes a scalability problem reported by Andi Kleen and Tim Chen; they were quite secretive about the precise nature of their workload, but they later admitted that it only showed up when they were using a large sparse file, so the amount of data I/O that was needed was close to zero. I'm not sure how realistic this is and it's only a regression if you consider changes made since 2.6.39 to be a "regression" vis-a-vis the policy regarding post-merge window bug fixes, but Linus agreed it was worth fixing, so I'm including it in this pull request. This also fixes the journalled quota mount options, which I accidentally broke while I was cleaning up the mount option handling." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix handling of journalled quota options ext4: address scalability issue by removing extent cache statistics	2012-04-17 13:30:34 -07:00
Theodore Ts'o	57f73c2c89	ext4: fix handling of journalled quota options Commit `26092bf5` broke handling of journalled quota mount options by trying to parse argument of every mount option as a number. Fix this by dealing with the quota options before we call match_int(). Thanks to Jan Kara for discovering this regression. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2012-04-16 18:55:26 -04:00
Theodore Ts'o	9cd70b347e	ext4: address scalability issue by removing extent cache statistics Andi Kleen and Tim Chen have reported that under certain circumstances the extent cache statistics are causing scalability problems due to cache line bounces. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-04-16 12:16:20 -04:00
Al Viro	af1584f570	ext4: fix endianness breakage in ext4_split_extent_at() ->ee_len is __le16, so assigning cpu_to_le32() to it is going to do Bad Things(tm) on big-endian hosts... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-04-13 10:12:08 -04:00
Linus Torvalds	6268b325c3	Revert "ext4: don't release page refs in ext4_end_bio()" This reverts commit `b43d17f319`. Dave Jones reports that it causes lockups on his laptop, and his debug output showed a lot of processes hung waiting for page_writeback (or more commonly - processes hung waiting for a lock that was held during that writeback wait). The page_writeback hint made Ted suggest that Dave look at this commit, and Dave verified that reverting it makes his problems go away. Ted says: "That commit fixes a race which is seen when you write into fallocated (and hence uninitialized) disk blocks under very heavy memory pressure. Furthermore, although theoretically it could trigger under normal direct I/O writes, it only seems to trigger if you are issuing a huge number of AIO writes, such that a just-written page can get evicted from memory, and then read back into memory, before the workqueue has a chance to update the extent tree. This race has been around for a little over a year, and no one noticed until two months ago; it only happens under fairly exotic conditions, and in fact even after trying very hard to create a simple repro under lab conditions, we could only reproduce the problem and confirm the fix on production servers running MySQL on very fast PCIe-attached flash devices. Given that Dave was able to hit this problem pretty quickly, if we confirm that this commit is at fault, the only reasonable thing to do is to revert it IMO." Reported-and-tested-by: Dave Jones <davej@redhat.com> Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-29 17:00:56 -07:00
Linus Torvalds	71db34fc43	Merge branch 'for-3.4' of git://linux-nfs.org/~bfields/linux Pull nfsd changes from Bruce Fields: Highlights: - Benny Halevy and Tigran Mkrtchyan implemented some more 4.1 features, moving us closer to a complete 4.1 implementation. - Bernd Schubert fixed a long-standing problem with readdir cookies on ext2/3/4. - Jeff Layton performed a long-overdue overhaul of the server reboot recovery code which will allow us to deprecate the current code (a rather unusual user of the vfs), and give us some needed flexibility for further improvements. - Like the client, we now support numeric uid's and gid's in the auth_sys case, allowing easier upgrades from NFSv2/v3 to v4.x. Plus miscellaneous bugfixes and cleanup. Thanks to everyone! There are also some delegation fixes waiting on vfs review that I suppose will have to wait for 3.5. With that done I think we'll finally turn off the "EXPERIMENTAL" dependency for v4 (though that's mostly symbolic as it's been on by default in distro's for a while). And the list of 4.1 todo's should be achievable for 3.5 as well: http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues though we may still want a bit more experience with it before turning it on by default. * 'for-3.4' of git://linux-nfs.org/~bfields/linux: (55 commits) nfsd: only register cld pipe notifier when CONFIG_NFSD_V4 is enabled nfsd4: use auth_unix unconditionally on backchannel nfsd: fix NULL pointer dereference in cld_pipe_downcall nfsd4: memory corruption in numeric_name_to_id() sunrpc: skip portmap calls on sessions backchannel nfsd4: allow numeric idmapping nfsd: don't allow legacy client tracker init for anything but init_net nfsd: add notifier to handle mount/unmount of rpc_pipefs sb nfsd: add the infrastructure to handle the cld upcall nfsd: add a header describing upcall to nfsdcld nfsd: add a per-net-namespace struct for nfsd sunrpc: create nfsd dir in rpc_pipefs nfsd: add nfsd4_client_tracking_ops struct and a way to set it nfsd: convert nfs4_client->cl_cb_flags to a generic flags field NFSD: Fix nfs4_verifier memory alignment NFSD: Fix warnings when NFSD_DEBUG is not defined nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes) nfsd: rename 'int access' to 'int may_flags' in nfsd_open() ext4: return 32/64-bit dir name hash according to usage type fs: add new FMODE flags: FMODE_32bithash and FMODE_64bithash ...	2012-03-29 14:53:25 -07:00
Linus Torvalds	69e1aaddd6	Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes The changes to export dirty_writeback_interval are from Artem's s_dirt cleanup patch series. The same is true of the change to remove the s_dirt helper functions which never got used by anyone in-tree. I've run these changes by Al Viro, and am carrying them so that Artem can more easily fix up the rest of the file systems during the next merge window. (Originally we had hopped to remove the use of s_dirt from ext4 during this merge window, but his patches had some bugs, so I ultimately ended dropping them from the ext4 tree.) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABCAAGBQJPb39rAAoJENNvdpvBGATwVz8P/3V1NqSsk20VJOLbmEE45GxL GDzQJ6OsFG0UiQk6ISSrSdwxfav/KTCGySsU9UtAoOdPcBwnnsf8S7wc6OggwwuC hBFGwwFzk6YSQaZ58sUxWRGeOJuP/FPem6Id6buC4DQ1KIcznP/hEEgEnh/ir4Ec vrsfexY93TR8BE2Mi23v2epDVLU0B6bY/w9nDqbTXif3xN/gh/ypoHHouuM6Bs2n TyWHOwD15NwfnvRHd8PfDDqQM/D29x3QI0FMrWj9McpwIz4d4cBfhN4LQ/G+yLDY izv5DM10GbinwHPrsOTGVAW3KIdSS9rP3jCJGVuOrJZ9ufGXosvHuIYVhI7J3SBK JhBu6QEsN1IsvlVYpz9q8mqVKaDXQLsz2eaTw+i4yfmyOk1kOX7nIEOxYFF78G+V Of/W1SpIpJQaXvLHRcDj9fDj0fZTciUZA8v7/HOFS+co2dzIl0iZbcfBFp0/56RY sWdQoeRlx1ciVDPR+w2TQO5w3VWQw1gT5aqux0NiPj0XFoiUHScxgNGAYbqENMQw v9chvyDMlorqj0rF/Vey5SssgEDi7MTdYuYTi4YyMqr7pcvOJaO85pf+wH9g2eKW XhW33PhPGuwCJDP5Pg8Y0Z2Hp/Q3DCqhLqhGfTyAs/NG9+hR4wgp3VWb8CUqhA1t C/yzNeOYqScAefCzQx2V =+9zk -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates for 3.4 from Ted Ts'o: "Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes The changes to export dirty_writeback_interval are from Artem's s_dirt cleanup patch series. The same is true of the change to remove the s_dirt helper functions which never got used by anyone in-tree. I've run these changes by Al Viro, and am carrying them so that Artem can more easily fix up the rest of the file systems during the next merge window. (Originally we had hopped to remove the use of s_dirt from ext4 during this merge window, but his patches had some bugs, so I ultimately ended dropping them from the ext4 tree.)" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits) vfs: remove unused superblock helpers mm: export dirty_writeback_interval ext4: remove useless s_dirt assignment ext4: write superblock only once on unmount ext4: do not mark superblock as dirty unnecessarily ext4: correct ext4_punch_hole return codes ext4: remove restrictive checks for EOFBLOCKS_FL ext4: always set then trimmed blocks count into len ext4: fix trimmed block count accunting ext4: fix start and len arguments handling in ext4_trim_fs() ext4: update s_free_{inodes,blocks}_count during online resize ext4: change some printk() calls to use ext4_msg() instead ext4: avoid output message interleaving in ext4_error_<foo>() ext4: remove trailing newlines from ext4_msg() and ext4_error() messages ext4: add no_printk argument validation, fix fallout ext4: remove redundant "EXT4-fs: " from uses of ext4_msg ext4: give more helpful error message in ext4_ext_rm_leaf() ext4: remove unused code from ext4_ext_map_blocks() ext4: rewrite punch hole to use ext4_ext_remove_space() jbd2: cleanup journal tail after transaction commit ...	2012-03-28 10:02:55 -07:00
Artem Bityutskiy	182f514f88	ext4: remove useless s_dirt assignment Clean-up ext4 a tiny bit by removing useless s_dirt assignment in 'ext4_fill_super()' because a bit later we anyway call 'ext4_setup_super()' which writes the superblock to the media unconditionally. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-21 22:30:06 -04:00
Artem Bityutskiy	a8e25a8324	ext4: write superblock only once on unmount In some rather rare cases it is possible that ext4 may the superblock to the media twice. This patch makes sure this does not happen. This should speed up unmounting in those cases. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-21 22:29:15 -04:00
Artem Bityutskiy	1b8b9750f0	ext4: do not mark superblock as dirty unnecessarily Commit `a0375156ca` cleaned up superblock dirtying handling, but missed one place. This patch does what was intended: if we have the journal, then we update the superblock through the journal rather than doing this directly. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-21 22:28:29 -04:00
Allison Henderson	7335519274	ext4: correct ext4_punch_hole return codes ext4_punch_hole returns -ENOTSUPP but it should be using -EOPNOTSUPP Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-21 22:23:31 -04:00
Lukas Czerner	afcff5d80a	ext4: remove restrictive checks for EOFBLOCKS_FL We are going to remove the EOFBLOCKS_FL flag in the future, so this is the first part of the removal. We can not remove it entirely just now, since the e2fsck is still checking for it and it might cause headache to some people. Instead, remove the restrictive checks now and the rest later, when the new e2fsck code is out and common enough. This is also needed because punch hole already breaks the EOFBLOCKS_FL semantics, so it might cause the some troubles. So simply remove it. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-21 21:47:55 -04:00
Lukas Czerner	a7967f055a	ext4: always set then trimmed blocks count into len Currently if the range to trim is too small, for example on 1K fs the request to trim the first block, then the 'range->len' is not set reporting wrong number of discarded block to the caller. Fix this by always setting the 'range->len' before we return. Note that when there is a failure (-EINVAL) caller can not depend on 'range->len' being set properly. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-21 21:26:22 -04:00
Lukas Czerner	21e7fd22a5	ext4: fix trimmed block count accunting Currently when there is not enough free blocks in the block group to discard (grp->bb_free < minlen) the 'trimmed' is bumped up anyway with the number of discarded blocks from the previous iteration. Fix this by bumping up 'trimmed' only if the ext4_trim_all_free() was actually run. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-21 21:24:22 -04:00
Lukas Czerner	913eed83ed	ext4: fix start and len arguments handling in ext4_trim_fs() The overflow can happen when we are calling get_group_no_and_offset() which stores the group number in the ext4_grpblk_t type which is actually int. However when the blocknr is big enough the group number might be bigger than ext4_grpblk_t resulting in overflow. This will most likely happen with FITRIM default argument len = ULLONG_MAX. Fix this by using "end" variable instead of "start+len" as it is easier to get right and specifically check that the end is not beyond the end of the file system, so we are sure that the result of get_group_no_and_offset() will not overflow. Otherwise truncate it to the size of the file system. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-21 21:22:22 -04:00
Al Viro	07c0c5d8b8	ext4: initialization of ext4_li_mtx needs to be done earlier Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-20 22:05:02 -04:00
Al Viro	48fde701af	switch open-coded instances of d_make_root() to new helper Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-20 21:29:35 -04:00
Darrick J. Wong	636d7e2e3b	ext4: update s_free_{inodes,blocks}_count during online resize When we're doing an online resize of an ext4 filesystem, we need to update the free inode and block counts in the superblock so that fsck doesn't complain. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-20 15:46:11 -04:00
Theodore Ts'o	92b9781658	ext4: change some printk() calls to use ext4_msg() instead Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-19 23:41:49 -04:00
Joe Perches	d9ee81da93	ext4: avoid output message interleaving in ext4_error_<foo>() Using KERN_CONT means that messages from multiple threads may be interleaved. Avoid this by using a single printk call in ext4_error_inode and ext4_error_file. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-19 23:15:43 -04:00
Theodore Ts'o	1084f252e3	ext4: remove trailing newlines from ext4_msg() and ext4_error() messages The functions ext4_msg() and ext4_error() already tack on a trailing newline, so remove the unnecessary extra newline. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-19 23:13:43 -04:00
Joe Perches	ace36ad431	ext4: add no_printk argument validation, fix fallout Add argument validation to debug functions. Use ##__VA_ARGS__. Fix format and argument mismatches. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-19 23:11:43 -04:00
Joe Perches	7f6a11e73d	ext4: remove redundant "EXT4-fs: " from uses of ext4_msg ext4_msg adds "EXT4-fs: " to the messsage output. Remove the redundant bits from uses. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-19 23:09:43 -04:00
Lukas Czerner	dc1841d6cf	ext4: give more helpful error message in ext4_ext_rm_leaf() The error message produced by the ext4_ext_rm_leaf() when we are removing blocks which accidentally ends up inside the existing extent, is not very helpful, because we would like to also know which extent did we collide with. This commit changes the error message to get us also the information about the extent we are colliding with. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-19 23:07:43 -04:00
Lukas Czerner	7877191c28	ext4: remove unused code from ext4_ext_map_blocks() Since the commit 'Rewrite punch hole to use ext4_ext_remove_space()' reworked the punch hole implementation to use ext4_ext_remove_space() instead of ext4_ext_map_blocks(), we can remove the code which is no longer needed from the ext4_ext_map_blocks(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-19 23:05:43 -04:00
Lukas Czerner	5f95d21fb6	ext4: rewrite punch hole to use ext4_ext_remove_space() This commit rewrites ext4 punch hole implementation to use ext4_ext_remove_space() instead of its home gown way of doing this via ext4_ext_map_blocks(). There are several reasons for changing this. Firstly it is quite non obvious that punching hole needs to ext4_ext_map_blocks() to punch a hole, especially given that this function should map blocks, not unmap it. It also required a lot of new code in ext4_ext_map_blocks(). Secondly the design of it is not very effective. The reason is that we are trying to punch out blocks in ext4_ext_punch_hole() in opposite direction than in ext4_ext_rm_leaf() which causes the ext4_ext_rm_leaf() to iterate through the whole tree from the end to the start to find the requested extent for every extent we are going to punch out. And finally the current implementation does not use the existing code, but bring a lot of new code, which is IMO unnecessary since there already is some infrastructure we can use. Specifically ext4_ext_remove_space(). This commit changes ext4_ext_remove_space() to accept 'end' parameter so we can not only truncate to the end of file, but also remove the space in the middle of the file (punch a hole). Moreover, because the last block to punch out, might be in the middle of the extent, we have to split the extent at 'end + 1' so ext4_ext_rm_leaf() can easily either remove the whole fist part of split extent, or change its size. ext4_ext_remove_space() is then used to actually remove the space (extents) from within the hole, instead of ext4_ext_map_blocks(). Note that this also fix the issue with punch hole, where we would forget to remove empty index blocks from the extent tree, resulting in double free block error and file system corruption. This is simply because we now use different code path, where this problem does not exist. This has been tested with fsx running for several days and xfstests, plus xfstest #251 with '-o discard' run on the loop image (which converts discard requestes into punch hole to the backing file). All of it on 1K and 4K file system block size. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-19 23:03:19 -04:00
Fan Yong	d1f5273e9a	ext4: return 32/64-bit dir name hash according to usage type Traditionally ext2/3/4 has returned a 32-bit hash value from llseek() to appease NFSv2, which can only handle a 32-bit cookie for seekdir() and telldir(). However, this causes problems if there are 32-bit hash collisions, since the NFSv2 server can get stuck resending the same entries from the directory repeatedly. Allow ext4 to return a full 64-bit hash (both major and minor) for telldir to decrease the chance of hash collisions. This still needs integration on the NFS side. Patch-updated-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> (blame me if something is not correct) Signed-off-by: Fan Yong <yong.fan@whamcloud.com> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-18 22:44:40 -04:00
Theodore Ts'o	31d4f3a2f3	ext4: check for zero length extent Explicitly test for an extent whose length is zero, and flag that as a corrupted extent. This avoids a kernel BUG_ON assertion failure. Tested: Without this patch, the file system image found in tests/f_ext_zero_len/image.gz in the latest e2fsprogs sources causes a kernel panic. With this patch, an ext4 file system error is noted instead, and the file system is marked as being corrupted. https://bugzilla.kernel.org/show_bug.cgi?id=42859 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2012-03-11 23:30:16 -04:00
Curt Wohlgemuth	4188188bdc	ext4: add comments to definition of ext4_io_end_t This should make it more clear what this structure is used for, and how some of the (mutually exclusive) fields are used to keep page cache references. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-05 10:40:22 -05:00
Curt Wohlgemuth	b43d17f319	ext4: don't release page refs in ext4_end_bio() We can clear PageWriteback on each page when the IO completes, but we can't release the references on the page until we convert any uninitialized extents. Without this patch, the use of the dioread_nolock mount option can break buffered writes, because extents may not be converted by the time a subsequent buffered read comes in; if the page is not in the page cache, a read will return zeros if the extent is still uninitialized. I tested this with a (temporary) patch that adds a call to msleep(1000) at the start of ext4_end_io_work(), to delay processing of each DIO-unwritten work queue item. With this msleep(), a simple workload of fallocate write fadvise read will fail without this patch, succeeds with it. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-05 10:40:15 -05:00
Jeff Moyer	491caa4363	ext4: fix race between sync and completed io work The following command line will leave the aio-stress process unkillable on an ext4 file system (in my case, mounted on /mnt/test): aio-stress -t 20 -s 10 -O -S -o 2 -I 1000 /mnt/test/aiostress.3561.4 /mnt/test/aiostress.3561.4.20 /mnt/test/aiostress.3561.4.19 /mnt/test/aiostress.3561.4.18 /mnt/test/aiostress.3561.4.17 /mnt/test/aiostress.3561.4.16 /mnt/test/aiostress.3561.4.15 /mnt/test/aiostress.3561.4.14 /mnt/test/aiostress.3561.4.13 /mnt/test/aiostress.3561.4.12 /mnt/test/aiostress.3561.4.11 /mnt/test/aiostress.3561.4.10 /mnt/test/aiostress.3561.4.9 /mnt/test/aiostress.3561.4.8 /mnt/test/aiostress.3561.4.7 /mnt/test/aiostress.3561.4.6 /mnt/test/aiostress.3561.4.5 /mnt/test/aiostress.3561.4.4 /mnt/test/aiostress.3561.4.3 /mnt/test/aiostress.3561.4.2 This is using the aio-stress program from the xfstests test suite. That particular command line tells aio-stress to do random writes to 20 files from 20 threads (one thread per file). The files are NOT preallocated, so you will get writes to random offsets within the file, thus creating holes and extending i_size. It also opens the file with O_DIRECT and O_SYNC. On to the problem. When an I/O requires unwritten extent conversion, it is queued onto the completed_io_list for the ext4 inode. Two code paths will pull work items from this list. The first is the ext4_end_io_work routine, and the second is ext4_flush_completed_IO, which is called via the fsync path (and O_SYNC handling, as well). There are two issues I've found in these code paths. First, if the fsync path beats the work routine to a particular I/O, the work routine will free the io_end structure! It does not take into account the fact that the io_end may still be in use by the fsync path. I've fixed this issue by adding yet another IO_END flag, indicating that the io_end is being processed by the fsync path. The second problem is that the work routine will make an assignment to io->flag outside of the lock. I have witnessed this result in a hang at umount. Moving the flag setting inside the lock resolved that problem. The problem was introduced by commit `b82e384c7b` ("ext4: optimize locking for end_io extent conversion"), which first appeared in 3.2. As such, the fix should be backported to that release (probably along with the unwritten extent conversion race fix). Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> CC: stable@kernel.org	2012-03-05 10:29:52 -05:00
Jeff Moyer	93ef8541d5	ext4: clean up the flags passed to __blockdev_direct_IO For extent-based files, you can perform DIO to holes, as mentioned in the comments in ext4_ext_direct_IO. However, that function passes DIO_SKIP_HOLES to __blockdev_direct_IO, which is really confusing to the uninitiated reader. The key, here, is that the get_block function passed in, ext4_get_block_write, completely ignores the create flag that is passed to it (the create flag is passed in from the direct I/O code, which uses the DIO_SKIP_HOLES flag to determine whether or not it should be cleared). This is a long-winded way of saying that the DIO_SKIP_HOLES flag is ultimately ignored. So let's remove it. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-05 10:19:52 -05:00
Theodore Ts'o	f70486055e	ext4: try to deprecate noacl and noxattr_user mount options No other file system allows ACL's and extended attributes to be enabled or disabled via a mount option. So let's try to deprecate these options from ext4. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-04 22:06:20 -05:00
Theodore Ts'o	c7198b9c1e	ext4: ignore mount options supported by ext2/3 (but have since been removed) Users who tried to use the ext4 file system driver is being used for the ext2 or ext3 file systems (via the CONFIG_EXT4_USE_FOR_EXT23 option) could have failed mounts if their /etc/fstab contains options recognized by ext2 or ext3 but which have since been removed in ext4. So teach ext4 to recognize them and give a warning that the mount option was removed. Report: https://bbs.archlinux.org/profile.php?id=33804 Signed-off-by: Tom Gundersen <teg@jklm.no> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Baechler <thomas@archlinux.org> Cc: Tobias Powalowski <tobias.powalowski@googlemail.com> Cc: Dave Reisner <d@falconindy.com>	2012-03-04 22:00:53 -05:00
Theodore Ts'o	66acdcf4ea	ext4: add debugging /proc file showing file system options Now that /proc/mounts is consistently showing only those mount options which need to be specified in /etc/fstab or on the mount command line, it is useful to have file which shows exactly which file system options are enabled. This can be useful when debugging a user problem. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-04 20:21:38 -05:00
Theodore Ts'o	5a916be1b3	ext4: make ext4_show_options() be table-driven Consistently show mount options which are the non-default, so that /proc/mounts accurately shows the mount options that would be necessary to mount the file system in its current mode of operation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-04 19:27:31 -05:00
Theodore Ts'o	2adf6da837	ext4: move ext4_show_options() after parse_options() This commit is strictly a code movement so in preparation of changing ext4_show_options to be table driven. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-03 23:20:50 -05:00
Theodore Ts'o	26092bf524	ext4: use a table-driven handler for mount options By using a table-drive approach, we shave about 100 lines of code from ext4, and make the code a bit more regular and factored out. This will also make it possible in a future patch to use this table for displaying the mount options that were specified in /proc/mounts. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-03 23:20:47 -05:00
Theodore Ts'o	72578c33c4	ext4: unify handling of mount options which have been removed Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-03 18:04:40 -05:00
Theodore Ts'o	39ef17f1b0	ext4: simplify handling of the errors=* mount options Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-03 17:56:23 -05:00
Theodore Ts'o	c64db50e76	ext4: remove the I_VERSION mount flag and use the super_block flag instead There's no point to have two bits that are set in parallel; so use the MS_I_VERSION flag that is needed by the VFS anyway, and that way we free up a bit in sbi->s_mount_opts. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-02 12:23:11 -05:00
Theodore Ts'o	ee4a3fcd1d	ext4: remove Opt_ignore This is completely unused so let's just get rid of it. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-02 12:14:24 -05:00
Theodore Ts'o	87f26807e9	ext4: remove deprecation warnings for minix_df and grpid People complained about removing both of these features, so per Linus's dictate, we won't be able to remove them. Sigh... Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-03-02 00:03:21 -05:00
Santosh Nayak	85d216501a	ext4: Fix endianness bug when reading the MMP block Sparse complained about this endian bug in fs/ext4/mmp.c. Signed-off-by: Santosh Nayak <santoshprasadnayak@gmail.com> Reviewed-by: Johann Lombardi <johann@whamcloud.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-27 01:09:03 -05:00
Zheng Liu	9ee4930259	ext4: format flag in dx_probe() Fix ext4_warning format flag in dx_probe(). CC: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 23:09:36 -05:00
Eric Sandeen	c1bb05a657	ext4: avoid deadlock on sync-mounted FS w/o journal Processes hang forever on a sync-mounted ext2 file system that is mounted with the ext4 module (default in Fedora 16). I can reproduce this reliably by mounting an ext2 partition with "-o sync" and opening a new file an that partition with vim. vim will hang in "D" state forever. The same happens on ext4 without a journal. I am attaching a small patch here that solves this issue for me. In the sync mounted case without a journal, ext4_handle_dirty_metadata() may call sync_dirty_buffer(), which can't be called with buffer lock held. Also move mb_cache_entry_release inside lock to avoid race fixed previously by `8a2bfdcb` ext[34]: EA block reference count racing fix Note too that ext2 fixed this same problem in 2006 with `b2f49033` [PATCH] fix deadlock in ext2 Signed-off-by: Martin.Wilck@ts.fujitsu.com [sandeen@redhat.com: move mb_cache_entry_release before unlock, edit commit msg] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 23:06:18 -05:00
Lukas Czerner	a0ade1deb8	ext4: fix resize when resizing within single group When resizing file system in the way that the new size of the file system is still in the same group (no new groups are added), then we can hit a BUG_ON in ext4_alloc_group_tables() BUG_ON(flex_gd->count == 0 \|\| group_data == NULL); because flex_gd->count is zero. The reason is the missing check for such case, so the code always extend the last group fully and then attempt to add more groups, but at that time n_blocks_count is actually smaller than o_blocks_count. It can be easily reproduced like this: mkfs.ext4 -b 4096 /dev/sda 30M mount /dev/sda /mnt/test resize2fs /dev/sda 50M Fix this by checking whether the resize happens within the singe group and only add that many blocks into the last group to satisfy user request. Then o_blocks_count == n_blocks_count and the resize will exit successfully without and attempt to add more groups into the fs. Also fix mixing together block number and blocks count which might be confusing and can easily lead to off-by-one errors (but it is actually not the case here since the two occurrence of this mix-up will cancel each other). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reported-by: Milan Broz <mbroz@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 23:02:06 -05:00
Jeff Moyer	266991b138	ext4: fix race between unwritten extent conversion and truncate The following comment in ext4_end_io_dio caught my attention: /* XXX: probably should move into the real I/O completion handler */ inode_dio_done(inode); The truncate code takes i_mutex, then calls inode_dio_wait. Because the ext4 code path above will end up dropping the mutex before it is reacquired by the worker thread that does the extent conversion, it seems to me that the truncate can happen out of order. Jan Kara mentioned that this might result in error messages in the system logs, but that should be the extent of the "damage." The fix is pretty straight-forward: don't call inode_dio_done until the extent conversion is complete. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-02-20 17:59:24 -05:00
Heiko Carstens	d4dc462f55	ext4: fix balloc.c printk-format-warning Get rid of this one: fs/ext4/balloc.c: In function 'ext4_wait_block_bitmap': fs/ext4/balloc.c:405:3: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 6 has type 'sector_t' [-Wformat] Happens because sector_t is u64 (unsigned long long) or unsigned long dependent on CONFIG_64BIT. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 17:57:24 -05:00
Theodore Ts'o	c5e8f3f3bc	ext4: remove EXT4_MB_{BITMAP,BUDDY} macros The EXT4_MB_BITMAP and EXT4_MB_BUDDY macros obfuscate more than they provide any abstraction. So remove them. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 17:54:06 -05:00
Dan Carpenter	a0cc910f15	ext4: using PTR_ERR() on the wrong variable in ext4_ext_migrate() "inode" is a valid pointer here. "tmp_inode" was intended. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 17:53:06 -05:00
Dan Carpenter	4fda400360	ext4: remove an unneeded NULL check in __ext4_check_dir_entry() We dereference "bh" unconditionally a couple lines down to find "by->b_size". This function is never called with a NULL "bh" so I have removed the check. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 17:53:05 -05:00
Zheng Liu	f1b3a2a753	ext4: remove unneeded variable in ext4_xattr_check_block() We could return directly from ext4_xattr_check_block(). Thus, we shouldn't need to define a 'error' variable. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 17:53:05 -05:00
Eric Sandeen	661aa52057	ext4: remove the resize mount option The resize mount option seems to be of limited value, especially in the age of online resize2fs. Nuke it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 17:53:04 -05:00
Eric Sandeen	43e625d84f	ext4: remove the journal=update mount option The V2 journal format was introduced around ten years ago, for ext3. It seems highly unlikely that anyone will need this migration option for ext4. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 17:53:04 -05:00
Curt Wohlgemuth	1592d2c557	ext4: mark possibly unused variable in ext4_mb_normalize_request() The 'orig_size' local variable is only used in a call to mb_debug(). Mark it with '__maybe_unused'. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 17:53:03 -05:00
Bobi Jam	18aadd47f8	ext4: expand commit callback and The per-commit callback was used by mballoc code to manage free space bitmaps after deleted blocks have been released. This patch expands it to support multiple different callbacks, to allow other things to be done after the commit has been completed. Signed-off-by: Bobi Jam <bobijam@whamcloud.com> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 17:53:02 -05:00
Theodore Ts'o	856cbcf9a9	ext4: fix INCOMPAT feature codepoint reservation for INLINEDATA In commit `9b90e5e028` I incorrectly reserved the wrong bit for EXT4_FEATURE_INCOMPAT_INLINEDATA per the discussion on the linux-ext4 list on December 7, 2011. The codepoint 0x2000 should be used for EXT4_FEATURE_INCOMPAT_USE_META_CSUM, so INLINEDATA will be assigned the value 0x8000. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 17:53:01 -05:00
Lukas Czerner	3d2b158262	ext4: ignore EXT4_INODE_JOURNAL_DATA flag with delalloc Ext4 does not support data journalling with delayed allocation enabled. We even do not allow to mount the file system with delayed allocation and data journalling enabled, however it can be set via FS_IOC_SETFLAGS so we can hit the inode with EXT4_INODE_JOURNAL_DATA set even on file system mounted with delayed allocation (default) and that's where problem arises. The easies way to reproduce this problem is with the following set of commands: mkfs.ext4 /dev/sdd mount /dev/sdd /mnt/test1 dd if=/dev/zero of=/mnt/test1/file bs=1M count=4 chattr +j /mnt/test1/file dd if=/dev/zero of=/mnt/test1/file bs=1M count=4 conv=notrunc chattr -j /mnt/test1/file Additionally it can be reproduced quite reliably with xfstests 272 and 269. In fact the above reproducer is a part of test 272. To fix this we should ignore the EXT4_INODE_JOURNAL_DATA inode flag if the file system is mounted with delayed allocation. This can be easily done by fixing ext4_should__data() functions do ignore data journal flag when delalloc is set (suggested by Ted). We also have to set the appropriate address space operations for the inode (again, ignoring data journal flag if delalloc enabled). Additionally this commit introduces ext4_inode_journal_mode() function because ext4_should__data() has already had a lot of common code and this change is putting it all into one function so it is easier to read. Successfully tested with xfstests in following configurations: delalloc + data=ordered delalloc + data=writeback data=journal nodelalloc + data=ordered nodelalloc + data=writeback nodelalloc + data=journal Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-02-20 17:53:00 -05:00
Theodore Ts'o	813e57276f	ext4: fix race when setting bitmap_uptodate flag In ext4_read_{inode,block}_bitmap() we were setting bitmap_uptodate() before submitting the buffer for read. The is bad, since we check bitmap_uptodate() without locking the buffer, and so if another process is racing with us, it's possible that they will think the bitmap is uptodate even though the read has not completed yet, resulting in inodes and blocks potentially getting allocated more than once if we get really unlucky. Addresses-Google-Bug: 2828254 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-20 17:52:46 -05:00
Theodore Ts'o	119c0d4460	ext4: fold ext4_claim_inode into ext4_new_inode The function ext4_claim_inode() is only called by one function, ext4_new_inode(), and by folding the functionality into ext4_new_inode(), we can remove almost 50 lines of code, and put all of the logic of allocating a new inode into a single place. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-02-06 20:12:03 -05:00
Theodore Ts'o	ff9cb1c4ee	Merge branch 'for_linus' into for_linus_merged Conflicts: fs/ext4/ioctl.c	2012-01-10 11:54:07 -05:00
Xi Wang	d50f2ab6f0	ext4: fix undefined behavior in ext4_fill_flex_info() Commit `503358ae01` ("ext4: avoid divide by zero when trying to mount a corrupted file system") fixes CVE-2009-4307 by performing a sanity check on s_log_groups_per_flex, since it can be set to a bogus value by an attacker. sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex; groups_per_flex = 1 << sbi->s_log_groups_per_flex; if (groups_per_flex < 2) { ... } This patch fixes two potential issues in the previous commit. 1) The sanity check might only work on architectures like PowerPC. On x86, 5 bits are used for the shifting amount. That means, given a large s_log_groups_per_flex value like 36, groups_per_flex = 1 << 36 is essentially 1 << 4 = 16, rather than 0. This will bypass the check, leaving s_log_groups_per_flex and groups_per_flex inconsistent. 2) The sanity check relies on undefined behavior, i.e., oversized shift. A standard-confirming C compiler could rewrite the check in unexpected ways. Consider the following equivalent form, assuming groups_per_flex is unsigned for simplicity. groups_per_flex = 1 << sbi->s_log_groups_per_flex; if (groups_per_flex == 0 \|\| groups_per_flex == 1) { We compile the code snippet using Clang 3.0 and GCC 4.6. Clang will completely optimize away the check groups_per_flex == 0, leaving the patched code as vulnerable as the original. GCC keeps the check, but there is no guarantee that future versions will do the same. Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-01-10 11:51:10 -05:00
Linus Torvalds	e4e11180df	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: new helper - d_make_root() dcache: use a dispose list in select_parent ceph: d_alloc_root() may fail ext4: fix failure exits isofs: inode leak on mount failure	2012-01-09 17:37:37 -08:00

1 2 3 4 5 ...

1631 Commits