OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Josef Bacik	c3308f84c1	Btrfs: fix punch hole when no extent exists I saw the warning in btrfs_drop_extent_cache where our end is less than our start while running xfstests 68 in a loop. This is because we unconditionally do drop_end = min(end, extent_end) in __btrfs_drop_extents(), even though we may not have found an extent in the range we were looking to drop. So keep track of wether or not we found something, and if we didn't just use our end. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:40:00 -04:00
Josef Bacik	926ced123b	Btrfs: don't do anything in our ->freeze_fs and ->unfreeze_fs We do not need to do anything special to freeze or unfreeze, it's all taken care of by the generic work, and what we currently have is wrong anyway since we shouldn't be returnning to userspace with mutexes held anyway. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:40:00 -04:00
Josef Bacik	892951a92e	Btrfs: remove unused write cache pages hook The btree inode has it's own write cache pages so we can remove this write cache pages hook as it's not used. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:39:59 -04:00
Josef Bacik	b5bae2612a	Btrfs: fix race when getting the eb out of page->private We can race when checking wether PagePrivate is set on a page and we actually have an eb saved in the pages private pointer. We could have easily written out this page and released it in the time that we did the pagevec lookup and actually got around to looking at this page. So use mapping->private_lock to ensure we get a consistent view of the page->private pointer. This is inline with the alloc and releasepage paths which use private_lock when manipulating page->private. Thanks, Reported-by: David Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:39:59 -04:00
Josef Bacik	ff44c6e36d	Btrfs: do not hold the write_lock on the extent tree while logging Dave Sterba pointed out a sleeping while atomic bug while doing fsync. This is because I'm an idiot and didn't realize that rwlock's were spin locks, so we've been holding this thing while doing allocations and such which is not good. This patch fixes this by dropping the write lock before we do anything heavy and re-acquire it when it is done. We also need to take a ref on the em's in case their corresponding pages are evicted and mark them as being logged so that releasepage does not remove them and doesn't remove them from our local list. Thanks, Reported-by: Dave Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:39:58 -04:00
Josef Bacik	98114659e0	Btrfs: fix race with freeze and free space inodes So we start our freeze, somebody comes in and does an fsync() on a file where we have to commit a transaction for whatever reason, and we will deadlock because the freeze is waiting on FS_FREEZE people to stop writing to the file system, but the transaction is waiting for its free space inodes to be written out, which are in turn waiting on sb_start_intwrite while trying to write the file extents. To fix this we'll just skip the sb_start_intwrite() if we TRANS_JOIN_NOLOCK since we're being waited on by a transaction commit so we're safe wrt to freeze and this will keep us from deadlocking. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:39:58 -04:00
Liu Bo	6bbe3a9c80	Btrfs: kill obsolete arguments in btrfs_wait_ordered_extents nocow_only is now an obsolete argument. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-04 09:39:57 -04:00
Liu Bo	2e90cf858f	Btrfs: cleanup fs_info->hashers fs_info->hashers is now an obsolete one. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-04 09:39:57 -04:00
Liu Bo	ab26e9d6c8	Btrfs: cleanup for duplicated code in find_free_extent There is already an 'add free space' phrase in front of this one, we needn't to redo it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-04 09:39:57 -04:00
Josef Bacik	60376ce4a8	Btrfs: fix race in sync and freeze again I screwed this up, there is a race between checking if there is a running transaction and actually starting a transaction in sync where we could race with a freezer and get ourselves into trouble. To fix this we need to make a new join type to only do the try lock on the freeze stuff. If it fails we'll return EPERM and just return from sync. This fixes a hang Liu Bo reported when running xfstest 68 in a loop. Thanks, Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:39:56 -04:00
David Sterba	b3ae244e71	btrfs: return EPERM upon rmdir on a subvolume A subvolume cannot be deleted via rmdir, but the error code ENOTEMPTY is confusing. Return EPERM instead, as this is not permitted. Signed-off-by: David Sterba <dsterba@suse.cz>	2012-10-04 09:39:56 -04:00
Wei Yongjun	ebb3dad435	Btrfs: using for_each_set_bit_from to simplify the code Using for_each_set_bit_from() to simplify the code. spatch with a semantic match is used to found this. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>	2012-10-04 09:39:55 -04:00
Anand Jain	1bcea35597	Btrfs: write_buf is now callable outside send.c Developing service cmds needs it. Signed-off-by: Anand Jain <anand.jain@oracle.com>	2012-10-04 09:39:55 -04:00
Tsutomu Itoh	b4f359ab06	Btrfs: remove unnecessary code in btree_get_extent() Unnecessary lookup_extent_mapping() is removed because an error is returned to the caller. This patch was made based on the advice from Stefan Behrens, thanks. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-04 09:39:54 -04:00
Tsutomu Itoh	0433f20d43	Btrfs: cleanup of error processing in btree_get_extent() This patch simplifies a little complex error processing in btree_get_extent(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-04 09:39:54 -04:00
Andy Adamson	5f65753033	NFSv4 set open access operation call flag in nfs4_init_opendata_res nfs4_open_recover_helper zeros the nfs4_opendata result structures, removing the result access_request information which leads to an XDR decode error. Move the setting of the result access_request field to nfs4_init_opendata_res which sets all the other required nfs4_opendata result fields and is shared between the open and recover open paths. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-03 17:10:28 -07:00
Dan Carpenter	74b217d0d3	ore: signedness bug in _sp2d_min_pg() This for loop doesn't work correctly when "p" is unsigned. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-10-03 13:51:51 -07:00
Trond Myklebust	8544a9dc18	NFSv4.1: Remove the dependency on CONFIG_EXPERIMENTAL CONFIG_EXPERIMENTAL is deprecated and, regardless of that, this code is being enabled in most newer distributions. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-03 10:54:50 -07:00
Alex Elder	6285bc2312	ceph: avoid 32-bit page index overflow A pgoff_t is defined (by default) to have type (unsigned long). On architectures such as i686 that's a 32-bit type. The ceph address space code was attempting to produce 64 bit offsets by shifting a page's index by PAGE_CACHE_SHIFT, but the result was not what was desired because the shift occurred before the result got promoted to 64 bits. Fix this by converting all uses of page->index used in this way to use the page_offset() macro, which ensures the 64-bit result has the intended value. This fixes http://tracker.newdream.net/issues/3112 Reported-by: Mohamed Pakkeer <pakkeer.mohideen@realimage.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-10-03 10:51:18 -05:00
Sage Weil	457712a0bc	ceph: return EIO on invalid layout on GET_DATALOC ioctl If the user calls GET_DATALOC on a file with an invalid (e.g., zeroed) layout, return EIO to userland. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-10-03 10:51:17 -05:00
Linus Torvalds	eb0ad9c06d	JFS TRIM support and some minor fixes -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIcBAABAgAGBQJQbEYbAAoJEDaohF61QIxkd4QP/Akm+fi7I8vHFc7jdckxmt+X BS0i1SpMvxH29JhSx2ZaamNdgSW8hc3pVP2XkoICUMIDOBcx564y1mRRnOXZ1mdw NAwPmx4gP5OMtHOOJS0X2z7cIRnyFvhtpPtrbU2P5aOmdpEPpLMKilFoOOrWfsn0 bhNtFQOIZnYetG9isdNxi/P3NWctCDkL2x+PvKjUp081zDkJvSWjn6RB5QGaGKVF BOXXSK/1LYvRSJehYXekLObhlCCHUV8jMDsZ8dzJxiHaOdSOKPbjCGG1D8CJIYV2 7lOaZvjs7muQR4QME2CBQ6VDPW2KUPapAKRuBCglMXwP+Ym+aU/MpXYCE3nydgeQ kVR5jvwQcB2hRj8ALyCL79i6jM1DoN3rbaU2FHdU9enHmmEuG4424D5pwHHyUlRJ zb52HPGpilAHIyXHAhAl2EOvlFA60Sx1dUKiAkgc+mFKcsZaGUJw3XWSp99trSa2 f7ZHzrmQaq0tcoYDiH00wZZfCHPlwuXxmFkRrC721xjYNOaMZKAn8n7Xi42Ap30Z 7IzWhwNOOZuxx9CmEZDXY+6UAx28aRXsuiKwOeUVLyXcAdOx/9DxjoUjUbgNZqf0 wu6z3kXqkD3ZnkOiKcV1lE1oBtdgZtY2S92s6SBdduxojZqC9U8RYU9ogssePo5R pK69xihOXuAetSGmopPO =9O7B -----END PGP SIGNATURE----- Merge tag 'jfs-3.7' of git://github.com/kleikamp/linux-shaggy Pull JFS update from Dave Kleikamp: "JFS TRIM support and some minor fixes" * tag 'jfs-3.7' of git://github.com/kleikamp/linux-shaggy: jfs: Fix do_div precision in commit `b40c2e66` JFS: use list_move instead of list_del/list_add jfs: Remove obsolete email address fs/jfs: TRIM support for JFS Filesystem	2012-10-03 08:48:21 -07:00
Linus Torvalds	88265322c1	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "Highlights: - Integrity: add local fs integrity verification to detect offline attacks - Integrity: add digital signature verification - Simple stacking of Yama with other LSMs (per LSS discussions) - IBM vTPM support on ppc64 - Add new driver for Infineon I2C TIS TPM - Smack: add rule revocation for subject labels" Fixed conflicts with the user namespace support in kernel/auditsc.c and security/integrity/ima/ima_policy.c. * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (39 commits) Documentation: Update git repository URL for Smack userland tools ima: change flags container data type Smack: setprocattr memory leak fix Smack: implement revoking all rules for a subject label Smack: remove task_wait() hook. ima: audit log hashes ima: generic IMA action flag handling ima: rename ima_must_appraise_or_measure audit: export audit_log_task_info tpm: fix tpm_acpi sparse warning on different address spaces samples/seccomp: fix 31 bit build on s390 ima: digital signature verification support ima: add support for different security.ima data types ima: add ima_inode_setxattr/removexattr function and calls ima: add inode_post_setattr call ima: replace iint spinblock with rwlock/read_lock ima: allocating iint improvements ima: add appraise action keywords and default rules ima: integrity appraisal extension vfs: move ima_file_free before releasing the file ...	2012-10-02 21:38:48 -07:00
Linus Torvalds	782c3fb22b	No big changes for 3.7 in UBIFS: * Error reporting and debug printing improvements * Power cut emulation fixes * Minor cleanups -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJQaumRAAoJECmIfjd9wqK016UQAImRrzhPoegy2Pr7j/CpzUST wJru3QlgdHgjukWer/s4CGTTjmTLMgLAllUnJLO3yn+Po4o67RJxI4SjhMf5lwDG TSmB/ttQpRmYxaaS2EOC6Ow+jdMM2MhAsPrD8nAfLvOm/XduuSyquPY/xFa6U4rJ oE4w23b5t2BZI/UJmsCaYs6Rj1h8w1aZUukii+J9BN8DGygn/vJuI9EDNKdtQmxe vmoT2uWKPyL859kY/+lONH048NWMkEB3BhNmuAsFqGY/tdQRdmeyhgWEKNjbmc/M DXZe7ovy5umUyw6iDpfhEhmOBdV9xorDySsmKNP1/q60rUGD6R/tynC4y1q1IueB c6fbry6xpwKIccMS/w7x9cbZxMnG++iQoj71pr2FV9Bg5Xd7XduM/XjzI5rMHuqv EnmcwT/LIu5Lw1vMU8pfW2yIdTkotNR/2aca1bs0f4WtS3AhngdqmNbFmddcpY37 qvTYDrSMOW/IaegiDQzucVXNhIs8ufIULZzmj9CtmgeyvNVJboTJRik+HJDWFqeC 04TaM8k+Yjp/isbzTmWEyBG3G06cNpMwLtT5OKcpRVQ1ZYu62AbnR+Dbep6n0qt+ gBORW8TTY32GudeSiHLKCrOsrkx3zNSli6T4Sw9YD5e29Dee62KRTgjKVplezQfB Djh70vUyX2tHdYlW88gf =gM4M -----END PGP SIGNATURE----- Merge tag 'upstream-3.7-rc1' of git://git.infradead.org/linux-ubifs Pull ubifs changes from Artem Bityutskiy: "No big changes for 3.7 in UBIFS: - Error reporting and debug printing improvements - Power cut emulation fixes - Minor cleanups" Fix trivial conflict in fs/ubifs/debug.c due to the user namespace changes. * tag 'upstream-3.7-rc1' of git://git.infradead.org/linux-ubifs: UBIFS: print less UBIFS: use pr_ helper instead of printk UBIFS: comply with coding style UBIFS: use __aligned() attribute UBIFS: remove __DATE__ and __TIME__ UBIFS: fix power cut emulation for mtdram UBIFS: improve scanning debug output UBIFS: always print full error reports UBIFS: print PID in debug messages	2012-10-02 20:47:48 -07:00
Linus Torvalds	60c7b4df82	xfs: update for 3.7-rc1 Several enhancements and cleanups: - make inode32 and inode64 remountable options - SEEK_HOLE/SEEK_DATA enhancements - cleanup struct declarations in xfs_mount.h -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJQayQwAAoJENaLyazVq6ZO2H0P/RiWpYlVLP0ZNbxUOvL33WFy xOHwNkhRG23Z0sgDfrUJjDibzOuSmf/XothsChGPE4Mm8JLCjZKzxR77ySP+pbC5 sylrciXl5E2tY3wtW8oh2kQORLRneqA6qXNCrOtYklqwSmB0qx7bf9OdjL/ALiuC SvTAn/PhyVZqGJgVZyEAc63CsbSzK2XmLgyfeQeA7JbSptbg5fDkFYrbNE+zr+yW S1umMsuZW+zwDjzRq0lGkLcYOH3fD0JpMrfqvtQsk3+SIDlq1HSaObD63dhOJVUV LUpOqgXCPEC3nOL3Dh+i8gv5OTRLJW0P2zYuAY++kqYnuRsYU18tcfdd0loTGeHV LulurSIMBJV19Qx1K3C3E8KPwbwNwNlvmpEbgXy00Qet7LTPbofcUXDi3mcV6ozQ SMGQh42VGV4S8wEjAp93wLva7xLf6enV/X6Stuzy/ec8HN8K2tYMyYyGvSxZbjmq 9JTxCulDvbr7EnSDqCkH1inlQ4cXvBSzOi2trDpQbKq0cuYbqlzgL2sloBv+y+CB 4fTWEeeisQ05VhHU8Yp3bsVtiyUjGGcbOgvqM+NHmmh1HuHUmiBnXh7k3RfaLfBi E4RprFT0Yrpo16mt3Fn0GjkhY87DYHBGIehFOlu13zKPc5q+IBTwIjSugxoa23EL uXTtFFHMxR4pUV/An5Wf =G6gd -----END PGP SIGNATURE----- Merge tag 'for-linus-v3.7-rc1' of git://oss.sgi.com/xfs/xfs Pull xfs update from Ben Myers: "Several enhancements and cleanups: - make inode32 and inode64 remountable options - SEEK_HOLE/SEEK_DATA enhancements - cleanup struct declarations in xfs_mount.h" * tag 'for-linus-v3.7-rc1' of git://oss.sgi.com/xfs/xfs: xfs: Make inode32 a remountable option xfs: add inode64->inode32 transition into xfs_set_inode32() xfs: Fix mp->m_maxagi update during inode64 remount xfs: reduce code duplication handling inode32/64 options xfs: make inode64 as the default allocation mode xfs: Fix m_agirotor reset during AG selection Make inode64 a remountable option xfs: stop the sync worker before xfs_unmountfs xfs: xfs_seek_hole() refinement with hole searching from page cache for unwritten extents xfs: xfs_seek_data() refinement with unwritten extents check up from page cache xfs: Introduce a helper routine to probe data or hole offset from page cache xfs: Remove type argument from xfs_seek_data()/xfs_seek_hole() xfs: fix race while discarding buffers [V4] xfs: check for possible overflow in xfs_ioc_trim xfs: unlock the AGI buffer when looping in xfs_dialloc xfs: kill struct declarations in xfs_mount.h xfs: fix uninitialised variable in xfs_rtbuf_get()	2012-10-02 20:42:58 -07:00
Linus Torvalds	aab174f0df	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: - big one - consolidation of descriptor-related logics; almost all of that is moved to fs/file.c (BTW, I'm seriously tempted to rename the result to fd.c. As it is, we have a situation when file_table.c is about handling of struct file and file.c is about handling of descriptor tables; the reasons are historical - file_table.c used to be about a static array of struct file we used to have way back). A lot of stray ends got cleaned up and converted to saner primitives, disgusting mess in android/binder.c is still disgusting, but at least doesn't poke so much in descriptor table guts anymore. A bunch of relatively minor races got fixed in process, plus an ext4 struct file leak. - related thing - fget_light() partially unuglified; see fdget() in there (and yes, it generates the code as good as we used to have). - also related - bits of Cyrill's procfs stuff that got entangled into that work; _not_ all of it, just the initial move to fs/proc/fd.c and switch of fdinfo to seq_file. - Alex's fs/coredump.c spiltoff - the same story, had been easier to take that commit than mess with conflicts. The rest is a separate pile, this was just a mechanical code movement. - a few misc patches all over the place. Not all for this cycle, there'll be more (and quite a few currently sit in akpm's tree)." Fix up trivial conflicts in the android binder driver, and some fairly simple conflicts due to two different changes to the sock_alloc_file() interface ("take descriptor handling from sock_alloc_file() to callers" vs "net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries" adding a dentry name to the socket) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits) MAX_LFS_FILESIZE should be a loff_t compat: fs: Generic compat_sys_sendfile implementation fs: push rcu_barrier() from deactivate_locked_super() to filesystems btrfs: reada_extent doesn't need kref for refcount coredump: move core dump functionality into its own file coredump: prevent double-free on an error path in core dumper usb/gadget: fix misannotations fcntl: fix misannotations ceph: don't abuse d_delete() on failure exits hypfs: ->d_parent is never NULL or negative vfs: delete surplus inode NULL check switch simple cases of fget_light to fdget new helpers: fdget()/fdput() switch o2hb_region_dev_write() to fget_light() proc_map_files_readdir(): don't bother with grabbing files make get_file() return its argument vhost_set_vring(): turn pollstart/pollstop into bool switch prctl_set_mm_exe_file() to fget_light() switch xfs_find_handle() to fget_light() switch xfs_swapext() to fget_light() ...	2012-10-02 20:25:04 -07:00
Catalin Marinas	8f9c0119d7	compat: fs: Generic compat_sys_sendfile implementation This function is used by sparc, powerpc and arm64 for compat support. The patch adds a generic implementation which calls do_sendfile() directly and avoids set_fs(). The sparc architecture has wrappers for the sign extensions while powerpc relies on the compiler to do the this. The patch adds wrappers for powerpc to handle the u32->int type conversion. compat_sys_sendfile64() can be replaced by a sys_sendfile() call since compat_loff_t has the same size as off_t on a 64-bit system. On powerpc, the patch also changes the 64-bit sendfile call from sys_sendile64 to sys_sendfile. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-02 21:35:55 -04:00
Kirill A. Shutemov	8c0a853770	fs: push rcu_barrier() from deactivate_locked_super() to filesystems There's no reason to call rcu_barrier() on every deactivate_locked_super(). We only need to make sure that all delayed rcu free inodes are flushed before we destroy related cache. Removing rcu_barrier() from deactivate_locked_super() affects some fast paths. E.g. on my machine exit_group() of a last process in IPC namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-02 21:35:55 -04:00
Al Viro	99621b44aa	btrfs: reada_extent doesn't need kref for refcount All increments and decrements are under the same spinlock - have to be, since they need to protect the radix_tree it's found in. Just use int, no need to wank with kref... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-02 21:35:55 -04:00
Alex Kelly	10c28d937e	coredump: move core dump functionality into its own file This prepares for making core dump functionality optional. The variable "suid_dumpable" and associated functions are left in fs/exec.c because they're used elsewhere, such as in ptrace. Signed-off-by: Alex Kelly <alex.page.kelly@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-02 21:35:55 -04:00
Andy Adamson	e23008ec81	NFSv4 reduce attribute requests for open reclaim We currently make no distinction in attribute requests between normal OPENs and OPEN with CLAIM_PREVIOUS. This offers more possibility of failures in the GETATTR response which foils OPEN reclaim attempts. Reduce the requested attributes to the bare minimum needed to update the reclaim open stateid and split nfs4_opendata_to_nfs4_state processing accordingly. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 18:12:25 -07:00
Linus Torvalds	fc47912d9c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: "A few drivers were updated with device tree bindings and others got a few small cleanups and fixes." Fix trivial conflict in drivers/input/keyboard/omap-keypad.c due to changes clashing with a whitespace cleanup. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (28 commits) Input: wacom - mark Intuos5 pad as in-prox when touching buttons Input: synaptics - adjust threshold for treating position values as negative Input: hgpk - use %*ph to dump small buffer Input: gpio_keys_polled - fix dt pdata->nbuttons Input: Add KD[GS]KBDIACRUC ioctls to the compatible list Input: omap-keypad - fixed formatting Input: tegra - move platform data header Input: wacom - add support for EMR on Cintiq 24HD touch Input: s3c2410_ts - make s3c_ts_pmops const Input: samsung-keypad - use of_get_child_count() helper Input: samsung-keypad - use of_match_ptr() Input: uinput - fix formatting Input: uinput - specify exact bit sizes on userspace APIs Input: uinput - mark failed submission requests as free Input: uinput - fix race that can block nonblocking read Input: uinput - return -EINVAL when read buffer size is too small Input: uinput - take event lock when fetching events from buffer Input: get rid of MATCH_BIT() macro Input: rotary-encoder - add DT bindings Input: rotary-encoder - constify platform data pointers ...	2012-10-02 17:16:10 -07:00
Trond Myklebust	807d66d802	NFSv4: nfs4_open_done first must check that GETATTR decoded a file type ...before it can check the validity of that file type. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 17:09:00 -07:00
Trond Myklebust	25a1a6211d	NFSv4.1: Deal with wraparound when updating the layout "barrier" seqid ...and fix a bug in pnfs_set_layout_stateid. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 17:04:33 -07:00
Trond Myklebust	5a65503f3d	NFSv4.1: Deal with wraparound issues when updating the layout stateid ...and add a helper function. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 16:47:14 -07:00
Trond Myklebust	038d649376	NFSv4.1: Always set the layout stateid if this is the first layoutget If the list of layout segments is empty, we must unconditionally set the layout stateid. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 16:38:41 -07:00
Trond Myklebust	251ec410c4	NFSv4.1: Fix another refcount issue in pnfs_find_alloc_layout Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 15:41:05 -07:00
Weston Andros Adamson	ae2bb03236	NFSv4: don't put ACCESS in OPEN compound if O_EXCL Don't put an ACCESS op in OPEN compound if O_EXCL, because ACCESS will return permission denied for all bits until close. Fixes a regression due to commit `6168f62c` (NFSv4: Add ACCESS operation to OPEN compound) Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 14:56:19 -07:00
Weston Andros Adamson	bbd3a8eee8	NFSv4: don't check MAY_WRITE access bit in OPEN Don't check MAY_WRITE as a newly created file may not have write mode bits, but POSIX allows the creating process to write regardless. This is ok because NFSv4 OPEN ops handle write permissions correctly - the ACCESS in the OPEN compound is to differentiate READ v EXEC permissions. Fixes a regression due to commit `6168f62c` (NFSv4: Add ACCESS operation to OPEN compound) Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 14:55:41 -07:00
Linus Torvalds	aecdc33e11	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking changes from David Miller: 1) GRE now works over ipv6, from Dmitry Kozlov. 2) Make SCTP more network namespace aware, from Eric Biederman. 3) TEAM driver now works with non-ethernet devices, from Jiri Pirko. 4) Make openvswitch network namespace aware, from Pravin B Shelar. 5) IPV6 NAT implementation, from Patrick McHardy. 6) Server side support for TCP Fast Open, from Jerry Chu and others. 7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel Borkmann. 8) Increate the loopback default MTU to 64K, from Eric Dumazet. 9) Use a per-task rather than per-socket page fragment allocator for outgoing networking traffic. This benefits processes that have very many mostly idle sockets, which is quite common. From Eric Dumazet. 10) Use up to 32K for page fragment allocations, with fallbacks to smaller sizes when higher order page allocations fail. Benefits are a) less segments for driver to process b) less calls to page allocator c) less waste of space. From Eric Dumazet. 11) Allow GRO to be used on GRE tunnels, from Eric Dumazet. 12) VXLAN device driver, one way to handle VLAN issues such as the limitation of 4096 VLAN IDs yet still have some level of isolation. From Stephen Hemminger. 13) As usual there is a large boatload of driver changes, with the scale perhaps tilted towards the wireless side this time around. Fix up various fairly trivial conflicts, mostly caused by the user namespace changes. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits) hyperv: Add buffer for extended info after the RNDIS response message. hyperv: Report actual status in receive completion packet hyperv: Remove extra allocated space for recv_pkt_list elements hyperv: Fix page buffer handling in rndis_filter_send_request() hyperv: Fix the missing return value in rndis_filter_set_packet_filter() hyperv: Fix the max_xfer_size in RNDIS initialization vxlan: put UDP socket in correct namespace vxlan: Depend on CONFIG_INET sfc: Fix the reported priorities of different filter types sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP sfc: Fix loopback self-test with separate_tx_channels=1 sfc: Fix MCDI structure field lookup sfc: Add parentheses around use of bitfield macro arguments sfc: Fix null function pointer in efx_sriov_channel_type vxlan: virtual extensible lan igmp: export symbol ip_mc_leave_group netlink: add attributes to fdb interface tg3: unconditionally select HWMON support when tg3 is enabled. Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT" gre: fix sparse warning ...	2012-10-02 13:38:27 -07:00
Bryan Schumaker	ddfc4e1712	NFS: Set key construction data for the legacy upcall This prevents a null pointer dereference when nfs_idmap_complete_pipe_upcall_locked() calls complete_request_key(). Fixes a regression caused by commit `0cac12023` (NFSv4: Ensure that idmap_pipe_downcall sanity-checks the downcall data). Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 13:04:09 -07:00
Weston Andros Adamson	fd4835708f	NFSv4.1: don't do two EXCHANGE_IDs on mount Since the addition of NFSv4 server trunking detection the mount context calls nfs4_proc_exchange_id then schedules the state manager, which also calls nfs4_proc_exchange_id. Setting the NFS4CLNT_LEASE_CONFIRM bit makes the state manager skip the unneeded EXCHANGE_ID and continue on with session creation. Reported-by: Jorge Mora <mora@netapp.com> Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 11:35:47 -07:00
Linus Torvalds	437589a74b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace changes from Eric Biederman: "This is a mostly modest set of changes to enable basic user namespace support. This allows the code to code to compile with user namespaces enabled and removes the assumption there is only the initial user namespace. Everything is converted except for the most complex of the filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs, nfs, ocfs2 and xfs as those patches need a bit more review. The strategy is to push kuid_t and kgid_t values are far down into subsystems and filesystems as reasonable. Leaving the make_kuid and from_kuid operations to happen at the edge of userspace, as the values come off the disk, and as the values come in from the network. Letting compile type incompatible compile errors (present when user namespaces are enabled) guide me to find the issues. The most tricky areas have been the places where we had an implicit union of uid and gid values and were storing them in an unsigned int. Those places were converted into explicit unions. I made certain to handle those places with simple trivial patches. Out of that work I discovered we have generic interfaces for storing quota by projid. I had never heard of the project identifiers before. Adding full user namespace support for project identifiers accounts for most of the code size growth in my git tree. Ultimately there will be work to relax privlige checks from "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing root in a user names to do those things that today we only forbid to non-root users because it will confuse suid root applications. While I was pushing kuid_t and kgid_t changes deep into the audit code I made a few other cleanups. I capitalized on the fact we process netlink messages in the context of the message sender. I removed usage of NETLINK_CRED, and started directly using current->tty. Some of these patches have also made it into maintainer trees, with no problems from identical code from different trees showing up in linux-next. After reading through all of this code I feel like I might be able to win a game of kernel trivial pursuit." Fix up some fairly trivial conflicts in netfilter uid/git logging code. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits) userns: Convert the ufs filesystem to use kuid/kgid where appropriate userns: Convert the udf filesystem to use kuid/kgid where appropriate userns: Convert ubifs to use kuid/kgid userns: Convert squashfs to use kuid/kgid where appropriate userns: Convert reiserfs to use kuid and kgid where appropriate userns: Convert jfs to use kuid/kgid where appropriate userns: Convert jffs2 to use kuid and kgid where appropriate userns: Convert hpfs to use kuid and kgid where appropriate userns: Convert btrfs to use kuid/kgid where appropriate userns: Convert bfs to use kuid/kgid where appropriate userns: Convert affs to use kuid/kgid wherwe appropriate userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids userns: On ia64 deal with current_uid and current_gid being kuid and kgid userns: On ppc convert current_uid from a kuid before printing. userns: Convert s390 getting uid and gid system calls to use kuid and kgid userns: Convert s390 hypfs to use kuid and kgid where appropriate userns: Convert binder ipc to use kuids userns: Teach security_path_chown to take kuids and kgids userns: Add user namespace support to IMA userns: Convert EVM to deal with kuids and kgids in it's hmac computation ...	2012-10-02 11:11:09 -07:00
Linus Torvalds	c0e8a139a5	Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - xattr support added. The implementation is shared with tmpfs. The usage is restricted and intended to be used to manage per-cgroup metadata by system software. tmpfs changes are routed through this branch with Hugh's permission. - cgroup subsystem ID handling simplified. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Define CGROUP_SUBSYS_COUNT according the configuration cgroup: Assign subsystem IDs during compile time cgroup: Do not depend on a given order when populating the subsys array cgroup: Wrap subsystem selection macro cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT cgroup: net_prio: Do not define task_netpioidx() when not selected cgroup: net_cls: Do not define task_cls_classid() when not selected cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h cgroup: trivial fixes for Documentation/cgroups/cgroups.txt xattr: mark variable as uninitialized to make both gcc and smatch happy fs: add missing documentation to simple_xattr functions cgroup: add documentation on extended attributes usage cgroup: rename subsys_bits to subsys_mask cgroup: add xattr support cgroup: revise how we re-populate root directory xattr: extract simple_xattr code from tmpfs	2012-10-02 10:50:47 -07:00
Linus Torvalds	033d9959ed	Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue changes from Tejun Heo: "This is workqueue updates for v3.7-rc1. A lot of activities this round including considerable API and behavior cleanups. * delayed_work combines a timer and a work item. The handling of the timer part has always been a bit clunky leading to confusing cancelation API with weird corner-case behaviors. delayed_work is updated to use new IRQ safe timer and cancelation now works as expected. * Another deficiency of delayed_work was lack of the counterpart of mod_timer() which led to cancel+queue combinations or open-coded timer+work usages. mod_delayed_work[_on]() are added. These two delayed_work changes make delayed_work provide interface and behave like timer which is executed with process context. * A work item could be executed concurrently on multiple CPUs, which is rather unintuitive and made flush_work() behavior confusing and half-broken under certain circumstances. This problem doesn't exist for non-reentrant workqueues. While non-reentrancy check isn't free, the overhead is incurred only when a work item bounces across different CPUs and even in simulated pathological scenario the overhead isn't too high. All workqueues are made non-reentrant. This removes the distinction between flush_[delayed_]work() and flush_[delayed_]_work_sync(). The former is now as strong as the latter and the specified work item is guaranteed to have finished execution of any previous queueing on return. * In addition to the various bug fixes, Lai redid and simplified CPU hotplug handling significantly. * Joonsoo introduced system_highpri_wq and used it during CPU hotplug. There are two merge commits - one to pull in IRQ safe timer from tip/timers/core and the other to pull in CPU hotplug fixes from wq/for-3.6-fixes as Lai's hotplug restructuring depended on them." Fixed a number of trivial conflicts, but the more interesting conflicts were silent ones where the deprecated interfaces had been used by new code in the merge window, and thus didn't cause any real data conflicts. Tejun pointed out a few of them, I fixed a couple more. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits) workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() workqueue: remove @delayed from cwq_dec_nr_in_flight() workqueue: fix possible stall on try_to_grab_pending() of a delayed work item workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue: use __cpuinit instead of __devinit for cpu callbacks workqueue: rename manager_mutex to assoc_mutex workqueue: WORKER_REBIND is no longer necessary for idle rebinding workqueue: WORKER_REBIND is no longer necessary for busy rebinding workqueue: reimplement idle worker rebinding workqueue: deprecate __cancel_delayed_work() workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() workqueue: use mod_delayed_work() instead of __cancel + queue workqueue: use irqsafe timer for delayed_work workqueue: clean up delayed_work initializers and add missing one workqueue: make deferrable delayed_work initializer names consistent workqueue: cosmetic whitespace updates for macro definitions workqueue: deprecate system_nrt[_freezable]_wq workqueue: deprecate flush[_delayed]_work_sync() ...	2012-10-02 09:54:49 -07:00
Chuck Lever	c2ccc084eb	NFS: nfs41_walk_client_list(): re-lock before iterating Sparse identified an execution path in nfs41_walk_client_list() where the nfs_client_lock is not re-acquired before taking the next loop iteration. fs/nfs/nfs4client.c:437:9: sparse: context imbalance in 'nfs41_walk_client_list' - different lock contexts for basic block Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 09:25:02 -07:00
Trond Myklebust	ee314c2a35	NFSv4.1: Handle BAD_STATEID and EXPIRED errors in layoutget If the layoutget call returns a stateid error, we want to invalidate the layout stateid, and/or recover the open stateid. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 08:34:29 -07:00
Trond Myklebust	6f018efac1	NFSv4.1: bl_pg_init_write should be static Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 08:34:29 -07:00
Stanislav Kinsbursky	4e437e95ae	nfs: include internal.h in getroot.h Sparse warning: fs/nfs/nfs4getroot.c:11:5: warning: symbol 'nfs4_get_rootfh' was not declared. Should it be static? Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 08:17:04 -07:00
Stanislav Kinsbursky	22e2430961	nfs: include nfs4_fh.h in nfs4sysctl.c Sparse warnings: fs/nfs/nfs4sysctl.c:56:5: warning: symbol 'nfs4_register_sysctl' was not declared. Should it be static? fs/nfs/nfs4sysctl.c:64:6: warning: symbol 'nfs4_unregister_sysctl' was not declared. Should it be static? Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 08:17:03 -07:00
Stanislav Kinsbursky	3dd4f8ef7b	nfs: declare nfs_xdev_mount as static Sparse warning: fs/nfs/super.c:2517:15: warning: symbol 'nfs_xdev_mount' was not declared. Should it be static? Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 08:17:03 -07:00
Stanislav Kinsbursky	0b37d20ca2	nfs: declare nfs_callback_tcp_port in header Sparse warning: fs/nfs/super.c:2638:16: sparse: symbol 'nfs_callback_tcpport' was not declared. Should it be static? Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 08:17:02 -07:00
Stanislav Kinsbursky	ca57ccc48f	nfs: include NFSv4 header in netns.h Build error: fs/nfs/netns.h:27:15: error: 'NFS4_MAX_MINOR_VERSION' undeclared here (not in a function) Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-02 08:17:02 -07:00
Andy Adamson	47b803c8d2	NFSv4.0 reclaim reboot state when re-establishing clientid We should reclaim reboot state when the clientid is stale. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 18:12:41 -07:00
Trond Myklebust	f9d640f3a4	NFSv4: nfs4_match_clientids is only used by NFSv4.1 Fix another compiler warning. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 16:58:48 -07:00
Trond Myklebust	758201e2c9	NFSv4: Fix the minor version callback channel startup The current spaghetti code confuses some versions of gcc (and just looks ugly as hell)! Clean up... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 16:58:39 -07:00
Trond Myklebust	9f62387d6e	NFSv4: Fix up a merge conflict between migration and container changes nfs_callback_tcpport is now per-net_namespace. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 16:17:31 -07:00
Trond Myklebust	2afdfa5a84	Merge branch 'bugfixes' into nfs-for-next	2012-10-01 15:42:14 -07:00
Daniel Walter	7297cb682a	nfs: replace strict_strto* with kstrto* [nfs] replace strict_str* with kstr* variants * replace string conversions with newer kstr* functions Signed-off-by: Daniel Walter <sahne@0x90.at> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:40:05 -07:00
Yanchuan Nian	ee34e13620	NFS: Remove unnecessary semicolons (fs/nfs/client.c) There are some unnecessary semicolons in function find_nfs_version. Just remove them. Signed-off-by: Yanchuan Nian <ycnian@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:40:03 -07:00
Wei Yongjun	4e266229db	pnfsblock: use list_move_tail instead of list_del/list_add_tail Using list_move_tail() instead of list_del() + list_add_tail(). spatch with a semantic match is used to found this problem. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:39:05 -07:00
Peng Tao	96c9eae638	pnfsblock: fix non-aligned DIO write For DIO writes, if it is not blocksize aligned, we need to do internal serialization. It may slow down writers anyway. So we just bail them out and resend to MDS. Cc: stable <stable@vger.kernel.org> [since v3.4] Signed-off-by: Peng Tao <tao.peng@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:38:35 -07:00
Peng Tao	f742dc4a32	pnfsblock: fix non-aligned DIO read For DIO read, if it is not sector aligned, we should reject it and resend via MDS. Otherwise there might be data corruption. Also teach bl_read_pagelist to handle partial page reads for DIO. Cc: stable <stable@vger.kernel.org> [since v3.4] Signed-off-by: Peng Tao <tao.peng@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:38:29 -07:00
Peng Tao	fe6e1e8d9f	pnfsblock: fix partial page buffer wirte If applications use flock to protect its write range, generic NFS will not do read-modify-write cycle at page cache level. Therefore LD should know how to handle non-sector aligned writes. Otherwise there will be data corruption. Cc: stable <stable@vger.kernel.org> Signed-off-by: Peng Tao <tao.peng@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:38:24 -07:00
Peng Tao	5d0e3a004f	Revert "pnfsblock: bail out partial page IO" This reverts commit `159e0561e3`, in favor of a more complete fix to the alignment issue. Signed-off-by: Peng Tao <tao.peng@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:38:16 -07:00
Peng Tao	dc182549d4	NFS41: fix error of setting blocklayoutdriver After commit `e38eb650` (NFS: set_pnfs_layoutdriver() from nfs4_proc_fsinfo()), set_pnfs_layoutdriver() is called inside nfs4_proc_fsinfo(), but pnfs_blksize is not set. It causes setting blocklayoutdriver failure and pnfsblock mount failure. Cc: stable <stable@vger.kernel.org> [since v3.5] Signed-off-by: Peng Tao <tao.peng@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:37:39 -07:00
Peng Tao	7acdb02681	NFSv41: fix DIO write_io calculation pnfs_within_mdsthreshold() is called inside pg_init. We need to set read_io/write_io before that. Otherwise we fail pnfs_within_mdsthreshold() and IO goes to MDS. A simple test case: dd if=foo of=/mnt/pnfs/bar bs=10M count=1 oflag=direct Signed-off-by: Peng Tao <tao.peng@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:37:34 -07:00
Chuck Lever	6f2ea7f2a3	NFS: Add nfs4_unique_id boot parameter An optional boot parameter is introduced to allow client administrators to specify a string that the Linux NFS client can insert into its nfs_client_id4 id string, to make it both more globally unique, and to ensure that it doesn't change even if the client's nodename changes. If this boot parameter is not specified, the client's nodename is used, as before. Client installation procedures can create a unique string (typically, a UUID) which remains unchanged during the lifetime of that client instance. This works just like creating a UUID for the label of the system's root and boot volumes. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:33:33 -07:00
Chuck Lever	05f4c350ee	NFS: Discover NFSv4 server trunking when mounting "Server trunking" is a fancy named for a multi-homed NFS server. Trunking might occur if a client sends NFS requests for a single workload to multiple network interfaces on the same server. There are some implications for NFSv4 state management that make it useful for a client to know if a single NFSv4 server instance is multi-homed. (Note this is only a consideration for NFSv4, not for legacy versions of NFS, which are stateless). If a client cares about server trunking, no NFSv4 operations can proceed until that client determines who it is talking to. Thus server IP trunking discovery must be done when the client first encounters an unfamiliar server IP address. The nfs_get_client() function walks the nfs_client_list and matches on server IP address. The outcome of that walk tells us immediately if we have an unfamiliar server IP address. It invokes nfs_init_client() in this case. Thus, nfs4_init_client() is a good spot to perform trunking discovery. Discovery requires a client to establish a fresh client ID, so our client will now send SETCLIENTID or EXCHANGE_ID as the first NFS operation after a successful ping, rather than waiting for an application to perform an operation that requires NFSv4 state. The exact process for detecting trunking is different for NFSv4.0 and NFSv4.1, so a minorversion-specific init_client callout method is introduced. CLID_INUSE recovery is important for the trunking discovery process. CLID_INUSE is a sign the server recognizes the client's nfs_client_id4 id string, but the client is using the wrong principal this time for the SETCLIENTID operation. The SETCLIENTID must be retried with a series of different principals until one works, and then the rest of trunking discovery can proceed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:33:33 -07:00
Chuck Lever	e984a55a74	NFS: Use the same nfs_client_id4 for every server Currently, when identifying itself to NFS servers, the Linux NFS client uses a unique nfs_client_id4.id string for each server IP address it talks with. For example, when client A talks to server X, the client identifies itself using a string like "AX". The requirements for these strings are specified in detail by RFC 3530 (and bis). This form of client identification presents a problem for Transparent State Migration. When client A's state on server X is migrated to server Y, it continues to be associated with string "AX." But, according to the rules of client string construction above, client A will present string "AY" when communicating with server Y. Server Y thus has no way to know that client A should be associated with the state migrated from server X. "AX" is all but abandoned, interfering with establishing fresh state for client A on server Y. To support transparent state migration, then, NFSv4.0 clients must instead use the same nfs_client_id4.id string to identify themselves to every NFS server; something like "A". Now a client identifies itself as "A" to server X. When a file system on server X transitions to server Y, and client A identifies itself as "A" to server Y, Y will know immediately that the state associated with "A," whether it is native or migrated, is owned by the client, and can merge both into a single lease. As a pre-requisite to adding support for NFSv4 migration to the Linux NFS client, this patch changes the way Linux identifies itself to NFS servers via the SETCLIENTID (NFSv4 minor version 0) and EXCHANGE_ID (NFSv4 minor version 1) operations. In addition to removing the server's IP address from nfs_client_id4, the Linux NFS client will also no longer use its own source IP address as part of the nfs_client_id4 string. On multi-homed clients, the value of this address depends on the address family and network routing used to contact the server, thus it can be different for each server. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:33:33 -07:00
Chuck Lever	896526174c	NFS: Introduce "migration" mount option Currently, the Linux client uses a unique nfs_client_id4.id string when identifying itself to distinct NFS servers. To support transparent state migration, the Linux client will have to use the same nfs_client_id4 string for all servers it communicates with (also known as the "uniform client string" approach). Otherwise NFS servers can not recognize that open and lock state need to be merged after a file system transition. Unfortunately, there are some NFSv4.0 servers currently in the field that do not tolerate the uniform client string approach. Thus, by default, our NFSv4.0 mounts will continue to use the current approach, and we introduce a mount option that switches them to use the uniform model. Client administrators must identify which servers can be mounted with this option. Eventually most NFSv4.0 servers will be able to handle the uniform approach, and we can change the default. The first mount of a server controls the behavior for all subsequent mounts for the lifetime of that set of mounts of that server. After the last mount of that server is gone, the client erases the data structure that tracks the lease. A subsequent lease may then honor a different "migration" setting. This patch adds only the infrastructure for parsing the new mount option. Support for uniform client strings is added in a subsequent patch. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:33:33 -07:00
Chuck Lever	ba9b584c1d	SUNRPC: Introduce rpc_clone_client_set_auth() An ULP is supposed to be able to replace a GSS rpc_auth object with another GSS rpc_auth object using rpcauth_create(). However, rpcauth_create() in 3.5 reliably fails with -EEXIST in this case. This is because when gss_create() attempts to create the upcall pipes, sometimes they are already there. For example if a pipe FS mount event occurs, or a previous GSS flavor was in use for this rpc_clnt. It turns out that's not the only problem here. While working on a fix for the above problem, we noticed that replacing an rpc_clnt's rpc_auth is not safe, since dereferencing the cl_auth field is not protected in any way. So we're deprecating the ability of rpcauth_create() to switch an rpc_clnt's security flavor during normal operation. Instead, let's add a fresh API that clones an rpc_clnt and gives the clone a new flavor before it's used. This makes immediate use of the new __rpc_clone_client() helper. This can be used in a similar fashion to rpcauth_create() when a client is hunting for the correct security flavor. Instead of replacing an rpc_clnt's security flavor in a loop, the ULP replaces the whole rpc_clnt. To fix the -EEXIST problem, any ULP logic that relies on replacing an rpc_clnt's rpc_auth with rpcauth_create() must be changed to use this API instead. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:33:33 -07:00
Chuck Lever	ffe5a83005	NFS: Slow down state manager after an unhandled error If the state manager thread is not actually able to fully recover from some situation, it wakes up waiters, who kick off a new state manager thread. Quite often the fresh invocation of the state manager is just as successful. This results in a livelock as the client dumps thousands of NFS requests a second on the network in a vain attempt to recover. Not very friendly. To mitigate this situation, add a delay in the state manager after an unhandled error, so that the client sends just a few requests every second in this case. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:31:51 -07:00
Chuck Lever	8cb7f74eee	NFS: nfs_parsed_mount_options can use unsigned int fs/nfs/super.c: In function ‘nfs_compare_remount_data’: fs/nfs/super.c:2042:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] fs/nfs/super.c:2043:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] fs/nfs/super.c:2044:20: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] fs/nfs/super.c:2046:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] fs/nfs/super.c:2047:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] fs/nfs/super.c:2048:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] fs/nfs/super.c:2049:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] fs/nfs/super.c:2050:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] Seen with gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:31:41 -07:00
Stanislav Kinsbursky	cb7323fffa	lockd: create and use per-net NSM RPC clients on MON/UNMON requests NSM RPC client can be required on NFSv3 umount, when child reaper is dying (and destroying it's mount namespace). It means, that current nsproxy is set to NULL already, but creation of RPC client requires UTS namespace for gaining hostname string. This patch creates reference-counted per-net NSM client on first monitor request and destroys it after last unmonitor request. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:27:43 -07:00
Stanislav Kinsbursky	303a7ce920	lockd: use rpc client's cl_nodename for id encoding Taking hostname from uts namespace if not safe, because this cuold be performind during umount operation on child reaper death. And in this case current->nsproxy is NULL already. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:27:38 -07:00
Linus Torvalds	797b9e5ae9	Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6 Pull CIFS updates from Steve French: "This patchset is the final section of the SMB2.1 support merge for cifs.ko. It also includes improvements to the cifs socket handling from Jeff, and also fixes a few cifs bug fixes. It adds SMB2 support for file and inode operations as well as moves some existing cifs code to use ops server struct of protocol specific callbacks. Most of this code is SMB2 specific. When enabled SMB2.1 does pass various functional tests including most of the connectathon test suite, For SMB2.1, Connectathon test 4 and some related tests fail due to not updating mode bits remotely (cifsacl support where mode bits are approximated with the cifs acl is not enable for smb2), and test8 (symlink) support is not completed for SMB2 yet (note that we will likely have a "Unix Extensions" eventually, at least for Samba, so in the long run posix locks won't have to be emulated when mounting Linux to Linux, but for most NAS and for Windows mounts posix lock emulation will still used for SMB2 in a similar fashion as we do for cifs). SMB2.1 dialect is supported. Although additional fixes to enable smb2 (the original smb2.02) dialect and to add various optional features of the smb3 dialect are expected to be added in the future as testing progresses, currently mounting with the "vers=2.1" is supported (in order to mount using SMB2.1 to servers like Samba 4, and Windows 7, Windows 2008R2)." * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: (82 commits) [CIFS] Fix indentation of fs/cifs/Kconfig entries [CIFS] Fix SMB2 negotiation support to select only one dialect (based on vers=) cifs: obtain file access during backup intent lookup (resend) CIFS: Fix possible freed pointer dereference in CIFS_SessSetup CIFS: Fix possible freed pointer dereference in SMB2_sess_setup CIFS: Make ops->close return void cifs: change DOS/NT/POSIX mapping of ERRnoresource cifs: remove support for deprecated "forcedirectio" and "strictcache" mount options cifs: remove support for CIFS_IOC_CHECKUMOUNT ioctl CIFS: Fix possible memory leaks in SMB2 code CIFS: Fix endian conversion of IndexNumber Trivial endian fixes MARK SMB2 support EXPERIMENTAL Update cifs version number cifs: add FL_CLOSE to fl_flags mask in cifs_read_flock cifs: Mangle string used for unc in /proc/mounts cifs: cleanups for cifs_mkdir_qinfo CIFS: Fix fast lease break after open problem CIFS: Add SMB2.1 lease break support CIFS: Fix cache coherency for read oplock case ...	2012-10-01 15:27:35 -07:00
Stanislav Kinsbursky	e9406db20f	lockd: per-net NSM client creation and destruction helpers introduced NSM RPC client can be required on NFSv3 umount, when child reaper is dying (and destroying it's mount namespace). It means, that current nsproxy is set to NULL already, but creation of RPC client requires UTS namespace for gaining hostname string. This patch introduces reference counted NFS RPC clients creation and destruction helpers (similar to RPCBIND RPC clients). Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:27:34 -07:00
Stanislav Kinsbursky	1dc42e04b7	NFS: add debug messages to callback down function Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:26:06 -07:00
Stanislav Kinsbursky	b3d19c5172	NFS: callback per-net usage counting introduced This patch also introduces refcount-aware nfs_callback_down_net() wrapper for svc_shutdown_net(). Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:57 -07:00
Stanislav Kinsbursky	29dcc16a8e	NFS: make nfs_callback_tcpport6 per network context Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:51 -07:00
Stanislav Kinsbursky	bbe0a3aa4e	NFS: make nfs_callback_tcpport per network context Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:47 -07:00
Stanislav Kinsbursky	23c20ecd44	NFS: callback up - users counting cleanup Usage coutner now increased only is the service was started sccessfully. Even if service is running already, then goto is not required anymore, because service creation and start will be skipped. With this patch code looks clearer. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:38 -07:00
Stanislav Kinsbursky	8e24614443	NFS: callback service start function introduced This is just a code move, which from my POW makes code looks better. I.e. now on start we have 3 different stages: 1) Service creation. 2) Service per-net data allocation. 3) Service start. Patch also renames goto label "out_err:" into "err_start:" to reflect new changes. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:35 -07:00
Stanislav Kinsbursky	691c457ae6	NFS: callback up - transport backchannel cleanup No need to assign transports backchannel server explicitly in nfs41_callback_up() - there is nfs_callback_bc_serv() function for this. By using it, nfs4_callback_up() and nfs41_callback_up() can be called without transport argument. Note: service have to be passed to nfs_callback_bc_serv() instead of callback, since callback link can be uninitialized. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:29 -07:00
Stanislav Kinsbursky	c946556b87	NFS: move per-net callback thread initialization to nfs_callback_up_net() v4: 1) Callback transport creation routine selection by version simlified. This new function in now called before nfs_minorversion_callback_svc_setup()). Also few small changes: 1) current network namespace in nfs_callback_up() was replaced by transport net. 2) svc_shutdown_net() was moved prior to callback usage counter decrement (because in case of per-net data allocation faulure svc_shutdown_net() have to be skipped). Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:24 -07:00
Stanislav Kinsbursky	dd018428dc	NFS: callback service creation function introduced This function creates service if it's not exist, or increase usage counter of the existent, and returns pointer to it. Usage counter will be droppepd by svc_destroy() later in nfs_callback_up(). Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:25:11 -07:00
Stanislav Kinsbursky	c8ceb4124b	NFS: pass net to nfs_callback_down() Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:24:51 -07:00
Weston Andros Adamson	6168f62cbd	NFSv4: Add ACCESS operation to OPEN compound The OPEN operation has no way to differentiate an open for read and an open for execution - both look like read to the server. This allowed users to read files that didn't have READ access but did have EXEC access, which is obviously wrong. This patch adds an ACCESS call to the OPEN compound to handle the difference between OPENs for reading and execution. Since we're going through the trouble of calling ACCESS, we check all possible access bits and cache the results hopefully avoiding an ACCESS call in the future. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:20:11 -07:00
Sage Weil	6816282dab	ceph: propagate layout error on osd request creation If we are creating an osd request and get an invalid layout, return an EINVAL to the caller. We switch up the return to have an error code instead of NULL implying -ENOMEM. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-10-01 17:20:00 -05:00
Bryan Schumaker	57a51048da	NFS: Use kzalloc() instead of kmalloc() in the idmapper This will allocate memory that has already been zeroed, allowing us to remove the memset later on. Signed-off-by: Bryan Schumaker <bjchuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:18:44 -07:00
Bryan Schumaker	6938867edb	NFS: Remove bad delegations during open recovery I put the client into an open recovery loop by: Client: Open file read half Server: Expire client (echo 0 > /sys/kernel/debug/nfsd/forget_clients) Client: Drop vm cache (echo 3 > /proc/sys/vm/drop_caches) finish reading file This causes a loop because the client never updates the nfs4_state after discovering that the delegation is invalid. This means it will keep trying to read using the bad delegation rather than attempting to re-open the file. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> CC: stable@vger.kernel.org [3.4+] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:17:25 -07:00
Bryan Schumaker	fcb6d9c6b7	NFS: Always use the open stateid when checking for expired opens If we are reading through a delegation, and the delegation is OK then state->stateid will still point to a delegation stateid and not an open stateid. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-10-01 15:17:17 -07:00
J. Bruce Fields	0d22f68f02	nfsd4: don't allow reclaims of expired clients When a confirmed client expires, we normally also need to expire any stable storage record which would allow that client to reclaim state on the next boot. We forgot to do this in some cases. (For example, in destroy_clientid, and in the cases in exchange_id and create_session that destroy and existing confirmed client.) But in most other cases, there's really no harm to calling nfsd4_client_record_remove(), because it is a no-op in the case the client doesn't have an existing The single exception is destroying a client on shutdown, when we want to keep the stable storage records so we can recognize which clients will be allowed to reclaim when we come back up. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:04 -04:00
J. Bruce Fields	6a3b156342	nfsd4: remove redundant callback probe Both nfsd4_init_conn and alloc_init_session are probing the callback channel, harmless but pointless. Also, nfsd4_init_conn should probably be probing in the "unknown" case as well. In fact I don't see any harm to just doing it unconditionally when we get a new backchannel connection. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:03 -04:00
J. Bruce Fields	8f9d3d3b7c	nfsd4: expire old client earlier Before we had to delay expiring a client till we'd found out whether the session and connection allocations would succeed. That's no longer necessary. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:03 -04:00
J. Bruce Fields	81f0b2a496	nfsd4: separate session allocation and initialization This will allow some further simplification. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:02 -04:00
J. Bruce Fields	a827bcb242	nfsd4: clean up session allocation Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:01 -04:00
J. Bruce Fields	1377b69e68	nfsd4: minor free_session cleanup Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:00 -04:00
J. Bruce Fields	e1ff371f9d	nfsd4: new_conn_from_crses should only allocate Do the initialization in the caller, and clarify that the only failure ever possible here was due to allocation. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:40:00 -04:00
J. Bruce Fields	3ba6367124	nfsd4: separate connection allocation and initialization It'll be useful to have connection allocation and initialization as separate functions. Also, note we'd been ignoring the alloc_conn error return in bind_conn_to_session. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:39:59 -04:00
J. Bruce Fields	4973050148	nfsd4: reject bad forechannel attrs earlier This could simplify the logic a little later. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:39:58 -04:00
J. Bruce Fields	d15c077e44	nfsd4: enforce per-client sessions/no-sessions distinction Something like creating a client with setclientid and then trying to confirm it with create_session may not crash the server, but I'm not completely positive of that, and in any case it's obviously bad client behavior. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:39:58 -04:00
J. Bruce Fields	c116a0af76	nfsd4: set cl_minorversion at create time And remove some mostly obsolete comments. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:39:57 -04:00
J. Bruce Fields	68eb35081e	nfsd4: don't pin clientids to pseudoflavors I added cr_flavor to the data compared in same_creds without any justification, in `d5497fc693` "nfsd4: move rq_flavor into svc_cred". Recent client changes then started making mount -osec=krb5 server:/export /mnt/ echo "hello" >/mnt/TMP umount /mnt/ mount -osec=krb5i server:/export /mnt/ echo "hello" >/mnt/TMP to fail due to a clid_inuse on the second open. Mounting sequentially like this with different flavors probably isn't that common outside artificial tests. Also, the real bug here may be that the server isn't just destroying the former clientid in this case (because it isn't good enough at recognizing when the old state is gone). But it prompted some discussion and a look back at the spec, and I think the check was probably wrong. Fix and document. Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-10-01 17:39:14 -04:00
Dmitry Torokhov	dde3ada3d0	Merge branch 'next' into for-linus Prepare first set of updates for 3.7 merge window.	2012-10-01 14:20:58 -07:00
Linus Torvalds	40689ac479	dlm for 3.7 There are two main patches in this set, both related to the userland dlm_controld daemon. The first fixes a deadlock between dlm_controld and the dlm_send workqueue when both access configfs data simultaneously. The second reworks some code to get around a long standing, but intentional, unlock balance warning. The userland daemon no longer takes a lock that is later released from the kernel. The other commits are minor fixes and changes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJQaeyUAAoJEDgbc8f8gGmqokAQALIt28LyaIAh9bfc0AWyLldM t81sT2gDR/uF0r2sLHdevunCecUbxR8CN5i+pxbTf1OsDkrhYQDx7GtQvTUefc2M /+JA/8UosZkrsTC0LvayWbsbC15c8epwxmERel/feinZ7dmgqNGBLaMPGJzZMv5/ KvHPRMg/JSJOpnwC64xZ097RsNkX5P8lwnK7U5ivs4vLCYR3ac3+V2+t/hxAMONE dwt7EZ/ADCBMUkQDTgisJA/Xr4yub+tTxUMshwkX+ve8nzSdj8dPzMFfky0bSady +x/r3GSVScQ//qqp8a01Jb2ZeFhKMfpTcPWve4chsngszsSzElJeMbw7numVuU98 7hCUbH5MeSuQ2Nx3qYfi1LhrV0687lbkNLlqET+3PX/r6m6T4SqucOmPv2SeSet9 rLGYNM1hJxx1rPASVtEsE/Xhr1agI58EO9tNb9p79wS4X5fcGqUsQXv4gRnu2/LN GRUTADC7nFDXy6BItYDwsuQfuJjMaDIsZ49dEUXYTlzav/zYK4IpU3fGuG3z/UDZ pQl+g0aK7rsq5hk7jxGfdHDBvhM+kcRMeQuTJqxJuJvLs0pmLWvFJXbyn+xh7y9Y MZYSSI4L3j1s4u6SkGUsMRezfcfJEYQdOzlIs7HEaLG++yBgcScF70ZgJByxzmfr eb9tpzwQvGLY4CvUVbEi =Ghb3 -----END PGP SIGNATURE----- Merge tag 'dlm-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "There are two main patches in this set, both related to the userland dlm_controld daemon. The first fixes a deadlock between dlm_controld and the dlm_send workqueue when both access configfs data simultaneously. The second reworks some code to get around a long standing, but intentional, unlock balance warning. The userland daemon no longer takes a lock that is later released from the kernel. The other commits are minor fixes and changes." * tag 'dlm-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: check the maximum size of a request from user dlm: cleanup send_to_sock routine dlm: convert add_sock routine return value type to void dlm: remove redundant variable assignments dlm: fix unlock balance warnings dlm: fix uninitialized spinlock dlm: fix deadlock between dlm_send and dlm_controld	2012-10-01 13:51:58 -07:00
Wei Yongjun	b905a7f8b7	ceph: convert to use le32_add_cpu() Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu(). dpatch engine is used to auto generate this patch. (https://github.com/weiyj/dpatch) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 14:30:54 -05:00
Yan, Zheng	3e8f43a089	ceph: Fix oops when handling mdsmap that decreases max_mds When i >= newmap->m_max_mds, ceph_mdsmap_get_addr(newmap, i) return NULL. Passing NULL to memcmp() triggers oops. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 14:30:54 -05:00
Alex Elder	c98f533c94	ceph: let path portion of mount "device" be optional A recent change to /sbin/mountall causes any trailing '/' character in the "device" (or fs_spec) field in /etc/fstab to be stripped. As a result, an entry for a ceph mount that intends to mount the root of the name space ends up with now path portion, and the ceph mount option processing code rejects this. That is, an entry in /etc/fstab like: cephserver:port:/ /mnt ceph defaults 0 0 provides to the ceph code just "cephserver:port:" as the "device," and that gets rejected. Although this is a bug in /sbin/mountall, we can have the ceph mount code support an empty/nonexistent path, interpreting it to mean the root of the name space. RFC 5952 offers recommendations for how to express IPv6 addresses, and recommends the usage found in RFC 3986 (which specifies the format for URI's) for representing both IPv4 and IPv6 addresses that include port numbers. (See in particular the definition of "authority" found in the Appendix of RFC 3986.) According to those standards, no host specification will ever contain a '/' character. As a result, it is sufficient to scan a provided "device" from an /etc/fstab entry for the first '/' character, and if it's found, treat that as the beginning of the path. If no '/' character is present, we can treat the entire string as the monitor host specification(s), and assume the path to be the root of the name space. We'll still require a ':' to separate the host portion from the (possibly empty) path portion. This means that we can more formally define how ceph will interpret the "device" it's provided when processing a mount request: "device" will look like: <server_spec>[,<server_spec>...]:[<path>] where <server_spec> is <ip>[:<port>] <path> is optional, but if present must begin with '/' This addresses http://tracker.newdream.net/issues/2919 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>	2012-10-01 14:30:49 -05:00
Linus Torvalds	3498d13b80	TTY merge for 3.7-rc1 As we skipped the merge window for 3.6-rc1 for the tty tree, everything is now settled down and working properly, so we are ready for 3.7-rc1. Here's the patchset, it's big, but the large changes are removing a firmware file and adding a staging tty driver (it depended on the tty core changes, so it's going through this tree instead of the staging tree.) All of these patches have been in the linux-next tree for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEABECAAYFAlBp36oACgkQMUfUDdst+yk4WgCdEy13hot8fI2Lqnc7W0LKu7GX 4p8AoLTjzrXhLosxdijskDQ9X1OtjrxU =S5Ng -----END PGP SIGNATURE----- Merge tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull TTY changes from Greg Kroah-Hartman: "As we skipped the merge window for 3.6-rc1 for the tty tree, everything is now settled down and working properly, so we are ready for 3.7-rc1. Here's the patchset, it's big, but the large changes are removing a firmware file and adding a staging tty driver (it depended on the tty core changes, so it's going through this tree instead of the staging tree.) All of these patches have been in the linux-next tree for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" Fix up more-or-less trivial conflicts in - drivers/char/pcmcia/synclink_cs.c: tty NULL dereference fix vs tty_port_cts_enabled() helper function - drivers/staging/{Kconfig,Makefile}: add-add conflict (dgrp driver added close to other staging drivers) - drivers/staging/ipack/devices/ipoctal.c: "split ipoctal_channel from iopctal" vs "TTY: use tty_port_register_device" * tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (235 commits) tty/serial: Add kgdb_nmi driver tty/serial/amba-pl011: Quiesce interrupts in poll_get_char tty/serial/amba-pl011: Implement poll_init callback tty/serial/core: Introduce poll_init callback kdb: Turn KGDB_KDB=n stubs into static inlines kdb: Implement disable_nmi command kernel/debug: Mask KGDB NMI upon entry serial: pl011: handle corruption at high clock speeds serial: sccnxp: Make 'default' choice in switch last serial: sccnxp: Remove mask termios caps for SW flow control serial: sccnxp: Report actual baudrate back to core serial: samsung: Add poll_get_char & poll_put_char Powerpc 8xx CPM_UART setting MAXIDL register proportionaly to baud rate Powerpc 8xx CPM_UART maxidl should not depend on fifo size Powerpc 8xx CPM_UART too many interrupts Powerpc 8xx CPM_UART desynchronisation serial: set correct baud_base for EXSYS EX-41092 Dual 16950 serial: omap: fix the reciever line error case 8250: blacklist Winbond CIR port 8250_pnp: do pnp probe before legacy probe ...	2012-10-01 12:26:52 -07:00
Miao Xie	90abccf2c6	Revert "Btrfs: do not do filemap_write_and_wait_range in fsync" This reverts commit `0885ef5b56` After applying the above patch, the performance slowed down because the dirty page flush can only be done by one task, so revert it. The following is the test result of sysbench: Before After 24MB/s 39MB/s Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:22 -04:00
Josef Bacik	698d0082c4	Btrfs: remove bytes argument from do_chunk_alloc Everybody is just making stuff up, and it's just used to see if we really do need to alloc a chunk, and since we do this when we already know we really do it's just a waste of space. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:21 -04:00
Josef Bacik	ea658badc4	Btrfs: delay block group item insertion So we have lots of places where we try to preallocate chunks in order to make sure we have enough space as we make our allocations. This has historically meant that we're constantly tweaking when we should allocate a new chunk, and historically we have gotten this horribly wrong so we way over allocate either metadata or data. To try and keep this from happening we are going to make it so that the block group item insertion is done out of band at the end of a transaction. This will allow us to create chunks even if we are trying to make an allocation for the extent tree. With this patch my enospc tests run faster (didn't expect this) and more efficiently use the disk space (this is what I wanted). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:21 -04:00
Kent Overstreet	be3940c0a9	btrfs: Kill some bi_idx references For immutable bio vecs, I've been auditing and removing bi_idx references. These were harmless, but removing them will make auditing easier. scrub_bio_end_io_worker() was open coding a bio_reset() - but this doesn't appear to have been needed for anything as right after it does a bio_put(), and perusing the code it doesn't appear anything else was holding a reference to the bio. The other use end_bio_extent_readpage() was just for a pr_debug() - changed it to something that might be a bit more useful. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Chris Mason <chris.mason@oracle.com> CC: Stefan Behrens <sbehrens@giantdisaster.de>	2012-10-01 15:19:21 -04:00
Miao Xie	962197babe	Btrfs: fix unnecessary warning when the fragments make the space alloc fail When we wrote some data by compress mode into a btrfs filesystem which was full of the fragments, the kernel will report: BTRFS warning (device xxx): Aborting unused transaction. The reason is: We can not find a long enough free space to store the compressed data because of the fragmentary free space, and the compressed data can not be splited, so the kernel outputed the above message. In fact, btrfs can deal with this problem very well: it fall back to uncompressed IO, split the uncompressed data into small ones, and then store them into to the fragmentary free space. So we shouldn't output the above warning message. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:20 -04:00
Josef Bacik	69ffb54347	Btrfs: create a pinned em when writing to a prealloc range in DIO Wade Cline reported a problem where he was getting garbage and warnings when writing to a preallocated range via O_DIRECT. This is because we weren't creating our normal pinned extent_map for the range we were writing to, which was causing all sorts of issues. This patch fixes the problem and makes his testcase much happier. Thanks, Reported-by: Wade Cline <clinew@linux.vnet.ibm.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:20 -04:00
Josef Bacik	6df7881a84	Btrfs: move the sb_end_intwrite until after the throttle logic Sage reported the following lockdep backtrace ===================================== [ BUG: bad unlock balance detected! ] 3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted ------------------------------------- btrfs-cleaner/7607 is trying to release lock (sb_internal) at: [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs] but there are no more locks to release! other info that might help us debug this: 1 lock held by btrfs-cleaner/7607: #0: (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs] stack backtrace: Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1 Call Trace: [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110 [<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310 [<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160 [<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs] [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [<ffffffff810b2a95>] lock_release+0xd5/0x220 [<ffffffff81173071>] ? kmem_cache_free+0x151/0x160 [<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90 [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60 [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40 [<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs] [<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs] [<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs] [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0 [<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs] [<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs] [<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs] [<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs] [<ffffffff810791ee>] kthread+0xae/0xc0 [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10 [<ffffffff81635430>] ? retint_restore_args+0x13/0x13 [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0 [<ffffffff8163e740>] ? gs_change+0x13/0x13 This is because the throttle stuff can commit the transaction, which expects to be the one stopping the intwrite stuff, but we've already done it in the __btrfs_end_transaction. Moving the sb_end_intewrite after this logic makes the lockdep go away. Thanks, Tested-by: Sage Weil <sage@inktank.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:19 -04:00
Liu Bo	425d17a290	Btrfs: use larger limit for translation of logical to inode This is the change of the kernel side. Translation of logical to inode used to have an upper limit 4k on inode container's size, but the limit is not large enough for a data with a great many of refs, so when resolving logical address, we can end up with "ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493" This changes to regard 64k as the upper limit and use vmalloc instead of kmalloc to get memory more easily. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:19 -04:00
Liu Bo	df031f0752	Btrfs: use helper for logical resolve We already have a helper, iterate_inodes_from_logical(), for logical resolve, so just use it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:18 -04:00
Liu Bo	69917e4312	Btrfs: fix a bug in parsing return value in logical resolve In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag. It is possible to lose our errors because (-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true. I'm not sure if it is on purpose, it just looks too hacky if it is. I'd rather use a real flag and a 'ret' to catch errors. Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Liu Bo <liub.liubo@gmail.com>	2012-10-01 15:19:18 -04:00
liubo	0647d6bd16	Btrfs: cleanup for unused ref cache stuff As ref cache has been removed from btrfs, there is no user on its lock and its check. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:17 -04:00
Miao Xie	8407aa4643	Btrfs: fix corrupted metadata in the snapshot When we delete a inode, we will remove all the delayed items including delayed inode update, and then truncate all the relative metadata. If there is lots of metadata, we will end the current transaction, and start a new transaction to truncate the left metadata. In this way, we will leave a inode item that its link counter is > 0, and also may leave some directory index items in fs/file tree after the current transaction ends. In other words, the metadata in this fs/file tree is inconsistent. If we create a snapshot for this tree now, we will find a inode with corrupted metadata in the new snapshot, and we won't continue to drop the left metadata, because its link counter is not 0. We fix this problem by updating the inode item before the current transaction ends. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:17 -04:00
David Sterba	837e197283	btrfs: polish names of kmem caches Usecase: watch 'grep btrfs < /proc/slabinfo' easy to watch all caches in one go. Signed-off-by: David Sterba <dsterba@suse.cz>	2012-10-01 15:19:16 -04:00
Josef Bacik	a80c8dcf7e	Btrfs: fix our overcommit math I noticed I was seeing large lags when running my torrent test in a vm on my laptop. While trying to make it lag less I noticed that our overcommit math was taking into account the number of bytes we wanted to reclaim, not the number of bytes we actually wanted to allocate, which means we wouldn't overcommit as often. This patch fixes the overcommit math and makes shrink_delalloc() use that logic so that it will stop looping faster. We still have pretty high spikes of latency, but the test now takes 3 minutes less time (about 5% faster). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:16 -04:00
Josef Bacik	dea31f5233	Btrfs: wait on async pages when shrinking delalloc Mitch reported a problem where you could get an ENOSPC error when untarring a kernel git tree onto a 16gb file system with compress-force=zlib. This is because compression is a huge pain, it will return from ->writepages() without having actually created any ordered extents. To get around this we check to see if the async submit counter is up, and if it is wait until it drops to 0 before doing our normal ordered wait dance. With this patch I can now untar a kernel git tree onto a 16gb file system without getting ENOSPC errors. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:15 -04:00
Liu Bo	9e8a4a8b0b	Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag We're going to use this flag EXTENT_DEFRAG to indicate which range belongs to defragment so that we can implement snapshow-aware defrag: We set the EXTENT_DEFRAG flag when dirtying the extents that need defragmented, so later on writeback thread can differentiate between normal writeback and writeback started by defragmentation. Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:15 -04:00
Tsutomu Itoh	3d6b5c3b5c	Btrfs: check return value of ulist_alloc() properly ulist_alloc() has the possibility of returning NULL. So, it is necessary to check the return value. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-01 15:19:14 -04:00
Tsutomu Itoh	f54fb859da	Btrfs: fix error handling in delete_block_group_cache() btrfs_iget() never return NULL. So, NULL check is unnecessary. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-01 15:19:14 -04:00
Miao Xie	903889f462	Btrfs: fix wrong size for the reservation when doing, file pre-allocation. When we ran fsstress(a program in xfstests), the filesystem hung up when it is full. It was because the space reserved in btrfs_fallocate() was wrong, btrfs_fallocate() just used the size of the pre-allocation to reserve the space, didn't took the block size aligning into account, so the size of the reserved space was less than the allocated space, it caused the over reserve problem and made the filesystem hung up when invoking cow_file_range(). Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:14 -04:00
Miao Xie	69ce977a17	Btrfs: output more information when aborting a unused transaction handle Though we dump the stack information when aborting a unused transaction handle, we don't know the correct place where we decide to abort the transaction handle if one function has several place where the transaction abort function is invoked and jumps to the same place after this call. And beside that we also don't know the reason why we jump to abort the current handle. So I modify the transaction abort function and make it output the function name, line and error information. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:13 -04:00
Miao Xie	2ecb79239b	Btrfs: fix unprotected ->log_batch We forget to protect ->log_batch when syncing a file, this patch fix this problem by atomic operation. And ->log_batch is used to check if there are parallel sync operations or not, so it is unnecessary to reset it to 0 after the sync operation of the current log tree complete. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:12 -04:00
Miao Xie	48c03c4bcf	Btrfs: fix wrong size for the reservation of the, snapshot creation We should insert/update 6 items(root ref, root backref, dir item, dir index, root item and parent inode) when creating a snapshot, not 5 items, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:12 -04:00
Miao Xie	42874b3db7	Btrfs: fix the snapshot that should not exist The snapshot should be the image of the fs tree before it was created, so the metadata of the snapshot should not exist in the its tree. But now, we found the directory item and directory name index is in both the snapshot tree and the fs tree. It introduces some problems and makes the users feel strange: # mkfs.btrfs /dev/sda1 # mount /dev/sda1 /mnt # mkdir /mnt/1 # cd /mnt/1 # btrfs subvolume snapshot /mnt snap0 # ls -a /mnt/1/snap0/1 . .. [no other file/dir] # ll /mnt/1/snap0/ total 0 drwxr-xr-x 1 root root 10 Ju1 24 12:11 1 ^^^ There is no file/dir in it, but it's size is 10 # cd /mnt/1/snap0/1/snap0 [Enter a unexisted directory successfully...] There is nothing in the directory 1 in snap0, but btrfs told the length of this directory is 10. Beside that, we can enter an unexisted directory, it is very strange to the users. # btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1 # ll /mnt/1/snap0/1/ total 0 [None] # ll /mnt/snap1/1/ total 0 drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0 And the source of snap1 did have any directory in Directory 1, but snap1 have a snap0, it is different between the source and the snapshot. So I think we should insert directory item and directory name index and update the parent inode as the last step of snapshot creation, and do not leave the useless metadata in the file tree. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:12 -04:00
Miao Xie	66d8f3dd1c	Btrfs: add a new "type" field into the block reservation structure Sometimes we need choose the method of the reservation according to the type of the block reservation, such as the reservation for the delayed inode update. Now we identify the type just by comparing the address of the reservation variants, it is very ugly if it is a temporary one because we need compare it with all the common reservation variants. So we add a new "type" field to keep the type the reservation variants. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:11 -04:00
Miao Xie	6352b91da1	Btrfs: use a slab for ordered extents allocation The ordered extent allocation is in the fast path of the IO, so use a slab to improve the speed of the allocation. "Size of the struct is 280, so this will fall into the size-512 bucket, giving 8 objects per page, while own slab will pack 14 objects into a page. Another benefit I see is to check for leaked objects when the module is removed (and the cache destroy takes place)." -- David Sterba Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:11 -04:00
Miao Xie	b9a8cc5bef	Btrfs: fix file extent discount problem in the, snapshot If a snapshot is created while we are writing some data into the file, the i_size of the corresponding file in the snapshot will be wrong, it will be beyond the end of the last file extent. And btrfsck will report: root 256 inode 257 errors 100 Steps to reproduce: # mkfs.btrfs <partition> # mount <partition> <mnt> # cd <mnt> # dd if=/dev/zero of=tmpfile bs=4M count=1024 & # for ((i=0; i<4; i++)) > do > btrfs sub snap . $i > done This because the algorithm of disk_i_size update is wrong. Though there are some ordered extents behind the current one which we use to update disk_i_size, it doesn't mean those extents will be dealt with in the same transaction. So We shouldn't use the offset of those extents to update disk_i_size. Or we will get the wrong i_size in the snapshot. We fix this problem by recording the max real i_size. If we find there is a ordered extent which is in front of the current one and doesn't complete, we will record the end of the current one into that ordered extent. Surely, if the current extent holds the end of other extent(it must be greater than the current one because it is behind the current one), we will record the number that the current extent holds. In this way, we can exclude the ordered extents that may not be dealth with in the same transaction, and be easy to know the real disk_i_size. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:10 -04:00
Miao Xie	361048f586	Btrfs: fix full backref problem when inserting shared block reference If we create several snapshots at the same time, the following BUG_ON() will be triggered. kernel BUG at fs/btrfs/extent-tree.c:6047! Steps to reproduce: # mkfs.btrfs <partition> # mount <partition> <mnt> # cd <mnt> # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done # for ((i=0; i<4; i++)) > do > mkdir $i > for ((j=0; j<200; j++)) > do > btrfs sub snap . $i/$j > done & > done The reason is: Before transaction commit, some operations changed the fs tree and new tree blocks were allocated because of COW. We used the implicit non-shared back reference for those newly allocated tree blocks because they were not shared by two or more trees. And then we created the first snapshot for the fs tree, according to the back reference rules, we also used implicit back refs for the child tree blocks of the root node of the fs tree, now those child nodes/leaves were shared by two trees. Then We didn't deal with the delayed references, and continued to change the fs tree(created the second snapshot and inserted the dir item of the new snapshot into the fs tree). According to the rules of the back reference, we added full back refs for those tree blocks whose parents have be shared by two trees. Now some newly allocated tree blocks had two types of the references. As we know, the delayed reference system handles these delayed references from back to front, and the full delayed reference is inserted after the implicit ones. So when we dealt with the back references of those newly allocated tree blocks, the full references was dealt with at first. And if the first reference is a shared back reference and the tree block that the reference points to is newly allocated, It would be considered as a tree block which is shared by two or more trees when it is allocated and should be a full back reference not a implicit one, the flag of its reference also should be set to FULL_BACKREF. But in fact, it was a non-shared tree block with a implicit reference at beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON was triggered. We have several methods to fix this bug: 1. deal with delayed references after the snapshot is created and before we change the source tree of the snapshot. This is the easiest and safest way. 2. modify the sort method of the delayed reference tree, make the full delayed references be inserted before the implicit ones. It is also very easy, but I don't know if it will introduce some problems or not. 3. modify select_delayed_ref() and make it select the implicit delayed reference at first. This way is not so good because it may wastes CPU time if we have lots of delayed references. 4. set the flags to FULL_BACKREF, this method is a little complex comparing with the 1st way. I chose the 1st way to fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:10 -04:00
Miao Xie	6fa9700e73	Btrfs: fix error path in create_pending_snapshot() This patch fixes the following problem: - If we failed to deal with the delayed dir items, we should abort transaction, just as its comment said. Fix it. - If root reference or root back reference insertion failed, we should abort transaction. Fix it. - Fix the double free problem of pending->inherit. - Do not restore the trans->rsv if we doesn't change it. - make the error path more clearly. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:09 -04:00
Wei Yongjun	cf93dccea6	Btrfs: fix possible memory leak in scrub_setup_recheck_block() bbio has been malloced in btrfs_map_block() and should be freed before leaving from the error handling cases. spatch with a semantic match is used to found this problem. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>	2012-10-01 15:19:09 -04:00
Josef Bacik	7014cdb493	Btrfs: btrfs_drop_extent_cache should never fail I noticed this when I was doing the fsync stuff, we allocate split extents if we drop an extent range that is in the middle of an existing extent. This BUG()'s if we fail to allocate memory, but the fact is this is just a cache, we will just regenerate the cache if we need it, the important part is that we free the range we are given. This can be done without allocations, so if we fail to allocate splits just skip the splitting stage and free our em and look for more extents to drop. This also makes btrfs_drop_extent_cache a void since nobody was checking the return value anyway. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:09 -04:00
Sage Weil	ac14aed665	Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs() Josef has suggested that this is not necessary. Removing it also avoids this lockdep splat (after the new sb_internal locking stuff was added): [ 604.090449] ====================================================== [ 604.114819] [ INFO: possible circular locking dependency detected ] [ 604.139262] 3.6.0-rc2-ceph-00144-g463b030 #1 Not tainted [ 604.162193] ------------------------------------------------------- [ 604.186139] btrfs-cleaner/6669 is trying to acquire lock: [ 604.209555] (sb_internal#2){.+.+..}, at: [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs] [ 604.257100] [ 604.257100] but task is already holding lock: [ 604.300366] (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs] [ 604.352989] [ 604.352989] which lock already depends on the new lock. [ 604.352989] [ 604.427104] [ 604.427104] the existing dependency chain (in reverse order) is: [ 604.478493] [ 604.478493] -> #1 (&fs_info->cleanup_work_sem){.+.+..}: [ 604.529313] [<ffffffff810b2c82>] lock_acquire+0xa2/0x140 [ 604.559621] [<ffffffff81632b69>] down_read+0x39/0x4e [ 604.589382] [<ffffffffa004db98>] btrfs_lookup_dentry+0x218/0x550 [btrfs] [ 604.596161] btrfs: unlinked 1 orphans [ 604.675002] [<ffffffffa006aadd>] create_subvol+0x62d/0x690 [btrfs] [ 604.708859] [<ffffffffa006d666>] btrfs_mksubvol.isra.52+0x346/0x3a0 [btrfs] [ 604.772466] [<ffffffffa006d7f2>] btrfs_ioctl_snap_create_transid+0x132/0x190 [btrfs] [ 604.842245] [<ffffffffa006d8ae>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs] [ 604.912852] [<ffffffffa00708ae>] btrfs_ioctl+0x138e/0x1990 [btrfs] [ 604.951888] [<ffffffff8118e9b8>] do_vfs_ioctl+0x98/0x560 [ 604.989961] [<ffffffff8118ef11>] sys_ioctl+0x91/0xa0 [ 605.026628] [<ffffffff8163d569>] system_call_fastpath+0x16/0x1b [ 605.064404] [ 605.064404] -> #0 (sb_internal#2){.+.+..}: [ 605.126832] [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90 [ 605.163671] [<ffffffff810b2c82>] lock_acquire+0xa2/0x140 [ 605.200228] [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0 [ 605.236818] [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs] [ 605.274029] [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs] [ 605.340520] [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs] [ 605.378720] [<ffffffff811972c8>] evict+0xb8/0x1c0 [ 605.416057] [<ffffffff811974d5>] iput+0x105/0x210 [ 605.452373] [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs] [ 605.521627] [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs] [ 605.560520] [<ffffffff810791ee>] kthread+0xae/0xc0 [ 605.598094] [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10 [ 605.636499] [ 605.636499] other info that might help us debug this: [ 605.636499] [ 605.736504] Possible unsafe locking scenario: [ 605.736504] [ 605.801931] CPU0 CPU1 [ 605.835126] ---- ---- [ 605.867093] lock(&fs_info->cleanup_work_sem); [ 605.898594] lock(sb_internal#2); [ 605.931954] lock(&fs_info->cleanup_work_sem); [ 605.965359] lock(sb_internal#2); [ 605.994758] [ 605.994758] * DEADLOCK * [ 605.994758] [ 606.075281] 2 locks held by btrfs-cleaner/6669: [ 606.104528] #0: (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b5d5>] cleaner_kthread+0x95/0x120 [btrfs] [ 606.165626] #1: (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs] [ 606.231297] [ 606.231297] stack backtrace: [ 606.287723] Pid: 6669, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00144-g463b030 #1 [ 606.347823] Call Trace: [ 606.376184] [<ffffffff8162a77c>] print_circular_bug+0x1fb/0x20c [ 606.409243] [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90 [ 606.441343] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.474583] [<ffffffff810b2c82>] lock_acquire+0xa2/0x140 [ 606.505934] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.539429] [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0 [ 606.571719] [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0 [ 606.603498] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.637405] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.670165] [<ffffffff81172e75>] ? kmem_cache_alloc+0xb5/0x160 [ 606.702144] [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs] [ 606.735562] [<ffffffffa00256a6>] ? block_rsv_add_bytes+0x56/0x80 [btrfs] [ 606.769861] [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs] [ 606.804575] [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs] [ 606.838756] [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40 [ 606.872010] [<ffffffff811972c8>] evict+0xb8/0x1c0 [ 606.903800] [<ffffffff811974d5>] iput+0x105/0x210 [ 606.935416] [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs] [ 606.970510] [<ffffffffa003b5d5>] ? cleaner_kthread+0x95/0x120 [btrfs] [ 607.005648] [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs] [ 607.040724] [<ffffffffa003b540>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs] [ 607.104740] [<ffffffff810791ee>] kthread+0xae/0xc0 [ 607.137119] [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10 [ 607.169797] [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10 [ 607.202472] [<ffffffff81635430>] ? retint_restore_args+0x13/0x13 [ 607.235884] [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0 [ 607.268731] [<ffffffff8163e740>] ? gs_change+0x13/0x13 Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 15:19:08 -04:00
Sage Weil	e209db7ace	Btrfs: set journal_info in async trans commit worker We expect current->journal_info to point to the trans handle we are committing. Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 15:19:08 -04:00
Sage Weil	6fc4e35485	Btrfs: pass lockdep rwsem metadata to async commit transaction The freeze rwsem is taken by sb_start_intwrite() and dropped during the commit_ or end_transaction(). In the async case, that happens in a worker thread. Tell lockdep the calling thread is releasing ownership of the rwsem and the async thread is picking it up. XFS plays the same trick in fs/xfs/xfs_aops.c. Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 15:19:07 -04:00
Josef Bacik	2aaa665581	Btrfs: add hole punching This patch adds hole punching via fallocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:07 -04:00
Josef Bacik	2671485d39	Btrfs: remove unused hint byte argument for btrfs_drop_extents I audited all users of btrfs_drop_extents and found that nobody actually uses the hint_byte argument. I'm sure it was used for something at some point but it's not used now, and the way the pinning works the disk bytenr would never be immediately useful anyway so lets just remove it. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:06 -04:00
Liu Bo	d279440511	Btrfs: check if an inode has no checksum when logging it This is based on Josef's "Btrfs: turbo charge fsync". If an inode is a BTRFS_INODE_NODATASUM one, we don't need to look for csum items any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:06 -04:00
Liu Bo	46d8bc3424	Btrfs: fix a bug in checking whether a inode is already in log This is based on Josef's "Btrfs: turbo charge fsync". The current btrfs checks if an inode is in log by comparing root's last_log_commit to inode's last_sub_trans[2]. But the problem is that this root->last_log_commit is shared among inodes. Say we have N inodes to be logged, after the first inode, root's last_log_commit is updated and the N-1 remained files will be skipped. This fixes the bug by keeping a local copy of root's last_log_commit inside each inode and this local copy will be maintained itself. [1]: we regard each log transaction as a subset of btrfs's transaction, i.e. sub_trans Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:06 -04:00
Miao Xie	321f0e7022	Btrfs: fix wrong orphan count of the fs/file tree If we add a new orphan item, we should increase the atomic counter, not decrease it. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:05 -04:00
Liu Bo	4e2f84e63d	Btrfs: improve fsync by filtering extents that we want This is based on Josef's "Btrfs: turbo charge fsync". The above Josef's patch performs very good in random sync write test, because we won't have too much extents to merge. However, it does not performs good on the test: dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync The reason is when we do sequencial sync write, we need to merge the current extent just with the previous one, so that we can get accumulated extents to log: A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ... So we'll have to flush more and more checksum into log tree, which is the bottleneck according to my tests. But we can avoid this by telling fsync the real extents that are needed to be logged. With this, I did the above dd sync write test (size=50m), w/o (orig) w/ (josef's) w/ (this) SATA 104KB/s 109KB/s 121KB/s ramdisk 1.5MB/s 1.5MB/s 10.7MB/s (613%) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:05 -04:00
Josef Bacik	ca7e70f590	Btrfs: do not needlessly restart the transaction for enospc We will stop and restart a transaction every time we move to a different leaf when truncating a file. This is for enospc reasons, but really we could probably get away with doing this a little better by actually working until we hit an ENOSPC. So add a ->failfast flag to the block_rsv and set it when we do truncates which will fail as soon as the block rsv runs out of space, and then at that point we can stop and restart the transaction and refill the block rsv and carry on. This will make rm'ing of a file with lots of extents a bit faster. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:04 -04:00
Liu Bo	06d3d22b45	Btrfs: cleanup extents after we finish logging inode This is based on Josef's "Btrfs: turbo charge fsync". We should cleanup those extents after we've finished logging inode, otherwise we may do redundant work on them. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:04 -04:00
Josef Bacik	0fa83cdb1d	Btrfs: only warn if we hit an error when doing the tree logging I hit this a couple times while working on my fsync patch (all my bugs, not normal operation), but with my new stuff we could have new errors from cases I have not encountered, so instead of BUG()'ing we should be WARN()'ing so that we are notified there is a problem but the user doesn't lose their data. We can easily commit the transaction in the case that the tree logging fails and still be fine, so let's try and be as nice to the user as possible. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:03 -04:00
Josef Bacik	5dc562c541	Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:03 -04:00
Josef Bacik	224ecce517	Btrfs: fix possible corruption when fsyncing written prealloced extents While working on my fsync patch my fsync tester kept hitting mismatching md5sums when I would randomly write to a prealloc'ed region, syncfs() and then write to the prealloced region some more and then fsync() and then immediately reboot. This is because the tree logging code will skip writing csums for file extents who's generation is less than the current running transaction. When we mark extents as written we haven't been updating their generation so they were always being skipped. This wouldn't happen if you were to preallocate and then write in the same transaction, but if you for example prealloced a VM you could definitely run into this problem. This patch makes my fsync tester happy again. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:02 -04:00
Josef Bacik	54338b5cc4	Btrfs: do not allocate chunks as agressively Swinging this pendulum back the other way. We've been allocating chunks up to 2% of the disk no matter how much we actually have allocated. So instead fix this calculation to only allocate chunks if we have more than 80% of the space available allocated. Please test this as it will likely cause all sorts of ENOSPC problems to pop up suddenly. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:02 -04:00
Josef Bacik	7c735313bd	Btrfs: update last trans if we don't update the inode There is a completely impossible situation to hit where you can preallocate a file, fsync it, write into the preallocated region, have the transaction commit twice and then fsync and then immediately lose power and lose all of the contents of the write. This patch fixes this just so I feel better about the situation and because it is lightweight, we just update the last_trans when we finish an ordered IO and we don't update the inode itself. This way we are completely safe and I feel better. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:02 -04:00
Jan Schmidt	995e01b7af	Btrfs: fix gcc warnings for 32bit compiles Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-10-01 15:19:01 -04:00
Chris Mason	74dd17fbe3	Btrfs: fix btrfs send for inline items and compression The btrfs send code was assuming the offset of the file item into the extent translated to bytes on disk. If we're compressed, this isn't true, and so it was off into extents owned by other files. It was also improperly handling inline extents. This solves a crash where we may have gone past the end of the file extent item by not testing early enough for an inline extent. It also solves problems where we have a whole between the end of the inline item and the start of the full extent. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-10-01 15:19:00 -04:00
Alexander Block	6d85ed05e1	Btrfs: don't treat top/root directory inode as deleted/reused We can't do the deleted/reused logic for top/root inodes as it would create a stream that tries to delete and recreate the root dir. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:19:00 -04:00
Alexander Block	2981e225f7	Btrfs: ignore non-FS inodes for send/receive We have to ignore inode/space cache objects in send/receive. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:59 -04:00
Alexander Block	2f28f4787c	Btrfs: pass root instead of parent_root to iterate_inode_ref We need to pass the root that we determined earlier to iterate_inode_ref. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:58 -04:00
Alexander Block	d8347fa444	Btrfs: use <= instead of < in is_extent_unchanged Used the wrong compare operator here. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:58 -04:00
Alexander Block	3954096d4b	Btrfs: fix check for changed extent in is_extent_unchanged The previous check was working fine, but this check should be easier to read. Also, we could theoritically have some exotic bugs with the previous checks. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:57 -04:00
Alexander Block	5dc67d0ba9	Btrfs: free nce and nce_head on error in name_cache_insert Both were leaked in case of error. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:56 -04:00
Alexander Block	3e126f32f8	Btrfs: remove unused tmp_path from iterate_dir_item A leftover from older code and unused now. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:55 -04:00
Alexander Block	e938c8ad54	Btrfs: code cleanups for send/receive Doing some code cleanups as suggested by Arne. Changes do not change any logic. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:55 -04:00
Alexander Block	766702ef49	Btrfs: add/fix comments/documentation for send/receive As the subject already said, add/fix comments. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:54 -04:00
Alexander Block	e479d9bb5f	Btrfs: update send_progress at correct places Updating send_progress in process_recorded_refs was not correct. It got updated too early in the cur_inode_new_gen case. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:53 -04:00
Alexander Block	34d73f54e2	Btrfs: make aux field of ulist 64 bit Btrfs send/receive uses the aux field to store inode numbers. On 32 bit machines this may become a problem. Also fix all users of ulist_add and ulist_add_merged. Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:53 -04:00
Alexander Block	7e0926fe5f	Btrfs: fix use of radix_tree for name_cache in send/receive We can't easily use the index of the radix tree for inums as the radix tree uses 32bit indexes on 32bit kernels. For 32bit kernels, we now use the lower 32bit of the inum as index and an additional list to store multiple entries per radix tree entry. Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:52 -04:00
Alexander Block	17589bd96e	Btrfs: fix memory leak for name_cache in send/receive When everything is done, name_cache_free is called which however forgot to call kfree on the cache entries. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:51 -04:00
Alexander Block	adbe7fb6c4	Btrfs: don't break in the final loop of find_extent_clone If we break, we may miss the clone from send_root which we prefer over all other clones. Commit is a result of Arne's review. Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:50 -04:00
Alexander Block	52f9e53ede	Btrfs: use normal return path for root == send_root case Don't have a seperate return path for the mentioned case. Now we do the same "take lowest inode/offset" logic for all found clones. Commit is a result of Arne's review. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:50 -04:00
Alexander Block	35075bb046	Btrfs: use kmalloc instead of stack for backref_ctx Make sure to never get in trouble due to the backref_ctx which was on the stack before. Commit is a result of Arne's review. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:49 -04:00
Alexander Block	ee849c0472	Btrfs: rename backref_ctx::found_in_send_root to found_itself The new name should be easier to understand/read. Commit is a result of Arne's review. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:48 -04:00
Alexander Block	d27aed5e24	Btrfs: remove unused use_list from send/receive code use_list is a leftover and unused. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:48 -04:00
Alexander Block	ccf1626b49	Btrfs: add correct parent to check_dirs when dir got moved We only added the parent for the new position of a moved dir. We also need to add the old parent of the moved dir. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:47 -04:00
Alexander Block	9ea3ef516d	Btrfs: remove unused code with #if 0 fs_path_remove is not used at the moment due to a previous patch. Remove it for now (with #if 0) to avoid compile warnings. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:46 -04:00
Alexander Block	b9291affaa	Btrfs: add missing check for dir != tmp_dir to is_first_ref We missed that check which resultet in all refs with the same name being reported as first_ref. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:45 -04:00
Alexander Block	1f4692da95	Btrfs: fix cur_ino < parent_ino case for send/receive When the current inodes inum is smaller then the inum of the parent directory strange things were happending due to wrong path resolution and other bugs. Fix this with a new approach for the problem. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:45 -04:00
Alexander Block	85a7b33b96	Btrfs: add rdev to get_inode_info in send/receive We need rdev in the next commit. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:44 -04:00
Linus Torvalds	06d2fe153b	Driver core merge for 3.7-rc1 Here is the big driver core update for 3.7-rc1. A number of firmware_class.c updates (as you saw a month or so ago), and some hyper-v updates and some printk fixes as well. All patches that are outside of the drivers/base area have been acked by the respective maintainers, and have all been in the linux-next tree for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEABECAAYFAlBp3vkACgkQMUfUDdst+ylQoACgldktGFgkCLzH+rGYthrXOC5P 9hUAnjmOhdoHlMTL81vWTlH+BrGernym =khrr -----END PGP SIGNATURE----- Merge tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core merge from Greg Kroah-Hartman: "Here is the big driver core update for 3.7-rc1. A number of firmware_class.c updates (as you saw a month or so ago), and some hyper-v updates and some printk fixes as well. All patches that are outside of the drivers/base area have been acked by the respective maintainers, and have all been in the linux-next tree for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" * tag 'driver-core-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits) memory: tegra{20,30}-mc: Fix reading incorrect register in mc_readl() device.h: Add missing inline to #ifndef CONFIG_PRINTK dev_vprintk_emit memory: emif: Add ifdef CONFIG_DEBUG_FS guard for emif_debugfs_[init\|exit] Documentation: Fixes some translation error in Documentation/zh_CN/gpio.txt Documentation: Remove 3 byte redundant code at the head of the Documentation/zh_CN/arm/booting Documentation: Chinese translation of Documentation/video4linux/omap3isp.txt device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit dev: Add dev_vprintk_emit and dev_printk_emit netdev_printk/netif_printk: Remove a superfluous logging colon netdev_printk/dynamic_netdev_dbg: Directly call printk_emit dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack driver-core: Shut up dev_dbg_reatelimited() without DEBUG tools/hv: Parse /etc/os-release tools/hv: Check for read/write errors tools/hv: Fix exit() error code tools/hv: Fix file handle leak Tools: hv: Implement the KVP verb - KVP_OP_GET_IP_INFO Tools: hv: Rename the function kvp_get_ip_address() Tools: hv: Implement the KVP verb - KVP_OP_SET_IP_INFO Tools: hv: Add an example script to configure an interface ...	2012-10-01 12:10:44 -07:00
Linus Torvalds	81f56e5375	Linux support for the 64-bit ARM architecture (AArch64) Features currently supported: - 39-bit address space for user and kernel (each) - 4KB and 64KB page configurations - Compat (32-bit) user applications (ARMv7, EABI only) - Flattened Device Tree (mandated for all AArch64 platforms) - ARM generic timers -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iQIcBAABAgAGBQJQabRiAAoJEGvWsS0AyF7xXgcQAK+FTXt0ikdQYMkV5AIZXb9i xHRhuiZWx2vKyk0mCqpyGLY58GSmSb6uTBg/2P2Ej7vXdH/RB2goPzjlspfjkDL4 o8RJp7eQ07Uz3KRDYEJgMP8xKZid6KFG93RJ6TjjpKZLuDBdwiG1GP1vb0jVcWfo ttZrj/aI8lMcqrh3Vq5qefP7GWP1OVATqeaGTiT7oo38pXwF3t237xfBr2iDGFBp ZgIRddrxpa7JYUesfJDDDdGHvLq7Vh2jJV+io9qasBZDrtppGJIhZ0vUni2DgIi7 r4i1LcynDN4JaG0maZ4U/YQm74TCD4BqxV8GJ7zwLPTWeN+of+skjhPSLOkA+0fp I+sWjXlv200gDfJZ9qnUld2kFpoDfJi2b7fNDouSDd2OhmVOVWG3jnVP4Z7meVSb O8BYzWDdsAiabuwciUY3OsmW6424lT93b2v86Vncs4unKMvEjOPxYZbUxhqX8f2j gsmWwwD/yS4THx2B6OyW9VT3I5J6miqs2Glt/GG6vPWT5AKQJn9jCxKaBGhPMPIs xe5/GycBYjdk/Y8qRjegxFbEqzQuiRzmkeFn5jwjmBLqpGNbZDpvMaL6adhAKM5/ v6UIKa91ra4fC9N0h6G61pOc9N9DbT8wPbCbdYY0RMTMRuLDZDgAM3Bvz0r2APdD 96leNy6vx684hbkCSLJs =buJB -----END PGP SIGNATURE----- Merge tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64 Pull arm64 support from Catalin Marinas: "Linux support for the 64-bit ARM architecture (AArch64) Features currently supported: - 39-bit address space for user and kernel (each) - 4KB and 64KB page configurations - Compat (32-bit) user applications (ARMv7, EABI only) - Flattened Device Tree (mandated for all AArch64 platforms) - ARM generic timers" * tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (35 commits) arm64: ptrace: remove obsolete ptrace request numbers from user headers arm64: Do not set the SMP/nAMP processor bit arm64: MAINTAINERS update arm64: Build infrastructure arm64: Miscellaneous header files arm64: Generic timers support arm64: Loadable modules arm64: Miscellaneous library functions arm64: Performance counters support arm64: Add support for /proc/sys/debug/exception-trace arm64: Debugging support arm64: Floating point and SIMD arm64: 32-bit (compat) applications support arm64: User access library functions arm64: Signal handling support arm64: VDSO support arm64: System calls handling arm64: ELF definitions arm64: SMP support arm64: DMA mapping API ...	2012-10-01 11:51:57 -07:00
Steve French	1d4ab90776	[CIFS] Fix indentation of fs/cifs/Kconfig entries make menuconfig for cifs shows multiple entries toward the end of the list with the incorrect indentation (probably a bug in Kconfig parsing of items that are dependant on the module (cifs=m instead of just CONFIG_CIFS). This patch fixes the indentation of all but the last entry (CIFS_ACL) which I don't know how to fix. It also clarifies wording in two places Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-10-01 12:48:03 -05:00
Steve French	e4aa25e780	[CIFS] Fix SMB2 negotiation support to select only one dialect (based on vers=) Based on whether the user (on mount command) chooses: vers=3.0 (for smb3.0 support) vers=2.1 (for smb2.1 support) or (with subsequent patch, which will allow SMB2 support) vers=2.0 (for original smb2.02 dialect support) send only one dialect at a time during negotiate (we had been sending a list). Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2012-10-01 12:26:22 -05:00
Linus Torvalds	99dbb1632f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull the trivial tree from Jiri Kosina: "Tiny usual fixes all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) doc: fix old config name of kprobetrace fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc btrfs: fix the commment for the action flags in delayed-ref.h btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID vfs: fix kerneldoc for generic_fh_to_parent() treewide: fix comment/printk/variable typos ipr: fix small coding style issues doc: fix broken utf8 encoding nfs: comment fix platform/x86: fix asus_laptop.wled_type module parameter mfd: printk/comment fixes doc: getdelays.c: remember to close() socket on error in create_nl_socket() doc: aliasing-test: close fd on write error mmc: fix comment typos dma: fix comments spi: fix comment/printk typos in spi Coccinelle: fix typo in memdup_user.cocci tmiofb: missing NULL pointer checks tools: perf: Fix typo in tools/perf tools/testing: fix comment / output typos ...	2012-10-01 09:06:36 -07:00
Linus Torvalds	e151960a23	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw Pull GFS2 updates from Steven Whitehouse: "The major feature this time is the "rbm" conversion in the resource group code. The new struct gfs2_rbm specifies the location of an allocatable block in (resource group, bitmap, offset) form. There are a number of added helper functions, and later patches then rewrite some of the resource group code in terms of this new structure. Not only does this give us a nice code clean up, but it also removes some of the previous restrictions where extents could not cross bitmap boundaries, for example. In addition to that, there are a few bug fixes and clean ups, but the rbm work is by far the majority of this patch set in terms of number of changed lines." * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (27 commits) GFS2: Write out dirty inode metadata in delayed deletes GFS2: fix s_writers.counter imbalance in gfs2_ail_empty_gl GFS2: Fix infinite loop in rbm_find GFS2: Consolidate free block searching functions GFS2: Get rid of I_MUTEX_QUOTA usage GFS2: Stop block extents at the end of bitmaps GFS2: Fix unclaimed_blocks() wrapping bug and clean up GFS2: Improve block reservation tracing GFS2: Fall back to ignoring reservations, if there are no other blocks left GFS2: Fix ->show_options() for statfs slow GFS2: Use rbm for gfs2_setbit() GFS2: Use rbm for gfs2_testbit() GFS2: Eliminate unnecessary check for state > 3 in bitfit GFS2: Eliminate redundant calls to may_grant GFS2: Combine functions gfs2_glock_dq_wait and wait_on_demote GFS2: Combine functions gfs2_glock_wait and wait_on_holder GFS2: inline __gfs2_glock_schedule_for_reclaim GFS2: change function gfs2_direct_IO to use a normal gfs2_glock_dq GFS2: rbm code cleanup GFS2: Fix case where reservation finished at end of rgrp ...	2012-10-01 08:51:04 -07:00
Theodore Ts'o	041bbb6d36	ext4: fix mtime update in nodelalloc mode Commits `5e8830dc85` and `41c4d25f78` introduced a regression into v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not take place for files modified via mmap if the page was already in the page cache. This would also affect ext3 file systems mounted using the ext4 file system driver. The problem was that ext4_page_mkwrite() had a shortcut which would avoid calling __block_page_mkwrite() under some circumstances, and the above two commit transferred the responsibility of calling file_update_time() to __block_page_mkwrite --- which woudln't get called in some circumstances. Since __block_page_mkwrite() only has three callers, block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the best way to solve this is to move the responsibility for calling file_update_time() to its caller. This problem was found via xfstests #215 with a file system mounted with -o nodelalloc. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: stable@vger.kernel.org	2012-09-30 23:04:56 -04:00
Dmitry Monakhov	6f2080e644	ext4: fix ext_remove_space for punch_hole case Inode is allowed to have empty leaf only if it this is blockless inode. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-30 23:03:50 -04:00
Dmitry Monakhov	02d262dffc	ext4: punch_hole should wait for DIO writers punch_hole is the place where we have to wait for all existing writers (writeback, aio, dio), but currently we simply flush pended end_io request which is not sufficient. Other issue is that punch_hole performed w/o i_mutex held which obviously result in dangerous data corruption due to write-after-free. This patch performs following changes: - Guard punch_hole with i_mutex - Recheck inode flags under i_mutex - Block all new dio readers in order to prevent information leak caused by read-after-free pattern. - punch_hole now wait for all writers in flight NOTE: XXX write-after-free race is still possible because new dirty pages may appear due to mmap(), and currently there is no easy way to stop writeback while punch_hole is in progress. [ Fixed error return from ext4_ext_punch_hole() to make sure that we release i_mutex before returning EPERM or ETXTBUSY -- Ted ] Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-30 23:03:42 -04:00
Al Viro	38b983b346	generic sys_execve() Selected by __ARCH_WANT_SYS_EXECVE in unistd.h. Requires * working current_pt_regs() * NOT doing a syscall-in-kernel kind of kernel_execve() implementation. Using generic kernel_execve() is fine. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-30 22:20:51 -04:00
Al Viro	282124d186	generic kernel_execve() based mostly on arm and alpha versions. Architectures can define __ARCH_WANT_KERNEL_EXECVE and use it, provided that * they have working current_pt_regs(), even for kernel threads. * kernel_thread-spawned threads do have space for pt_regs in the normal location. Normally that's as simple as switching to generic kernel_thread() and making sure that kernel threads do not go through return from syscall path; call the payload from equivalent of ret_from_fork if we are in a kernel thread (or just have separate ret_from_kernel_thread and make copy_thread() use it instead of ret_from_fork in kernel thread case). * they have ret_from_kernel_execve(); it is called after successful do_execve() done by kernel_execve() and gets normal pt_regs location passed to it as argument. It's essentially a longjmp() analog - it should set sp, etc. to the situation expected at the return for syscall and go there. Eventually the need for that sucker will disappear, but that'll take some surgery on kernel_thread() payloads. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-30 13:36:39 -04:00
Miklos Szeredi	8110e16d42	vfs: dcache: fix deadlock in tree traversal IBM reported a deadlock in select_parent(). This was found to be caused by taking rename_lock when already locked when restarting the tree traversal. There are two cases when the traversal needs to be restarted: 1) concurrent d_move(); this can only happen when not already locked, since taking rename_lock protects against concurrent d_move(). 2) racing with final d_put() on child just at the moment of ascending to parent; rename_lock doesn't protect against this rare race, so it can happen when already locked. Because of case 2, we need to be able to handle restarting the traversal when rename_lock is already held. This patch fixes all three callers of try_to_ascend(). IBM reported that the deadlock is gone with this patch. [ I rewrote the patch to be smaller and just do the "goto again" if the lock was already held, but credit goes to Miklos for the real work. - Linus ] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-09-29 17:41:40 -07:00
Brian Norris	74d83beaa2	JFFS2: don't fail on bitflips in OOB JFFS2 was designed without thought for OOB bitflips, it seems, but they can occur and will be reported to JFFS2 via mtd_read_oob()[1]. We don't want to fail on these transactions, since the data was corrected. [1] Few drivers report bitflips for OOB-only transactions. With such drivers, this patch should have no effect. Signed-off-by: Brian Norris <computersforpeace@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2012-09-29 15:34:13 +01:00
Artem Bityutskiy	a445f784ae	JFFS2: fix unmount regression This patch fixes regression introduced by "8bdc81c jffs2: get rid of jffs2_sync_super". We submit a delayed work in order to make sure the write-buffer is synchronized at some point. But we do not flush it when we unmount, which causes an oops when we unmount the file-system and then the delayed work is executed. This patch fixes the issue by adding a "cancel_delayed_work_sync()" infocation in the '->sync_fs()' handler. This will make sure the delayed work is canceled on sync, unmount and re-mount. And because VFS always callse 'sync_fs()' before unmounting or remounting, this fixes the issue. Reported-by: Ludovic Desroches <ludovic.desroches@atmel.com> Cc: stable@vger.kernel.org [3.5+] Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Ludovic Desroches <ludovic.desroches@atmel.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2012-09-29 14:58:42 +01:00
Dmitry Monakhov	1f555cfa29	ext4: serialize truncate with owerwrite DIO workers Jan Kara have spotted interesting issue: There are potential data corruption issue with direct IO overwrites racing with truncate: Like: dio write truncate_task ->ext4_ext_direct_IO ->overwrite == 1 ->down_read(&EXT4_I(inode)->i_data_sem); ->mutex_unlock(&inode->i_mutex); ->ext4_setattr() ->inode_dio_wait() ->truncate_setsize() ->ext4_truncate() ->down_write(&EXT4_I(inode)->i_data_sem); ->__blockdev_direct_IO ->ext4_get_block ->submit_io() ->up_read(&EXT4_I(inode)->i_data_sem); # truncate data blocks, allocate them to # other inode - bad stuff happens because # dio is still in flight. In order to serialize with truncate dio worker should grab extra i_dio_count reference before drop i_mutex. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-29 00:58:26 -04:00
Dmitry Monakhov	1b65007e98	ext4: endless truncate due to nonlocked dio readers If we have enough aggressive DIO readers, truncate and other dio waiters will wait forever inside inode_dio_wait(). It is reasonable to disable nonlock DIO read optimization during truncate. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-29 00:56:15 -04:00
Dmitry Monakhov	1c9114f9c0	ext4: serialize unlocked dio reads with truncate Current serialization will works only for DIO which holds i_mutex, but nonlocked DIO following race is possible: dio_nolock_read_task truncate_task ->ext4_setattr() ->inode_dio_wait() ->ext4_ext_direct_IO ->ext4_ind_direct_IO ->__blockdev_direct_IO ->ext4_get_block ->truncate_setsize() ->ext4_truncate() #alloc truncated blocks #to other inode ->submit_io() #INFORMATION LEAK In order to serialize with unlocked DIO reads we have to rearrange wait sequence 1) update i_size first 2) if i_size about to be reduced wait for outstanding DIO requests 3) and only after that truncate inode blocks Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-29 00:55:23 -04:00
Dmitry Monakhov	17335dcc47	ext4: serialize dio nonlocked reads with defrag workers Inode's block defrag and ext4_change_inode_journal_flag() may affect nonlocked DIO reads result, so proper synchronization required. - Add missed inode_dio_wait() calls where appropriate - Check inode state under extra i_dio_count reference. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-29 00:41:21 -04:00
Dmitry Monakhov	28a535f9a0	ext4: completed_io locking cleanup Current unwritten extent conversion state-machine is very fuzzy. - For unknown reason it performs conversion under i_mutex. What for? My diagnosis: We already protect extent tree with i_data_sem, truncate and punch_hole should wait for DIO, so the only data we have to protect is end_io->flags modification, but only flush_completed_IO and end_io_work modified this flags and we can serialize them via i_completed_io_lock. Currently all these games with mutex_trylock result in the following deadlock truncate: kworker: ext4_setattr ext4_end_io_work mutex_lock(i_mutex) inode_dio_wait(inode) ->BLOCK DEADLOCK<- mutex_trylock() inode_dio_done() #TEST_CASE1_BEGIN MNT=/mnt_scrach unlink $MNT/file fallocate -l $((102410241024)) $MNT/file aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file sleep 2 truncate -s 0 $MNT/file #TEST_CASE1_END Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286 This patch makes state machine simple and clean: (1) xxx_end_io schedule final extent conversion simply by calling ext4_add_complete_io(), which append it to ei->i_completed_io_list NOTE1: because of (2A) work should be queued only if ->i_completed_io_list was empty, otherwise the work is scheduled already. (2) ext4_flush_completed_IO is responsible for handling all pending end_io from ei->i_completed_io_list Flushing sequence consists of following stages: A) LOCKED: Atomically drain completed_io_list to local_list B) Perform extents conversion C) LOCKED: move converted io's to to_free list for final deletion This logic depends on context which we was called from. D) Final end_io context destruction NOTE1: i_mutex is no longer required because end_io->flags modification is protected by ei->ext4_complete_io_lock Full list of changes: - Move all completion end_io related routines to page-io.c in order to improve logic locality - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io() - remove EXT4_IO_END_FSYNC - Improve SMP scalability by removing useless i_mutex which does not protect io->flags anymore. - Reduce lock contention on i_completed_io_lock by optimizing list walk. - Rename ext4_end_io_nolock to end4_end_io and make it static - Check flush completion status to ext4_ext_punch_hole(). Because it is not good idea to punch blocks from corrupted inode. Changes since V3 (in request to Jan's comments): Fall back to active flush_completed_IO() approach in order to prevent performance issues with nolocked DIO reads. Changes since V2: Fix use-after-free caused by race truncate vs end_io_work Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2012-09-29 00:14:55 -04:00

... 2 3 4 5 6 ...

29083 Commits