OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Shaohua Li	849674e4fb	MD: rename some functions These short function names are hard to search. Rename them to make vim happy. Signed-off-by: Shaohua Li <shli@fb.com>	2016-01-20 13:52:20 -08:00
Linus Torvalds	3c28c9ccaf	md updates for 4.5 Mostly clustered-raid1 and raid5 journal updates. one Y2038 fix and other minor stuff. One patch removes me from the MAINTAINERS file and adds a record of my md maintainership to Credits. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWmJEhAAoJEDnsnt1WYoG5raQQAI9lBrHO+Q8C8RImPsemLX0X ypjH38XUwEwKNYYfsCVI7PKAqCl7r8ITzY054gKsU0iHAfqLQlEN8aMz0v0fJQhg Msb7utrEMQ0UERwNcc+3J78ffFAdWrkHVd64Ley0h/pizFPSlL0K2RuIGTBc9sGX Hz2Ci11Ch7FdK7C/Zl7I6tK1pkthu3hBXYEZyg1GngRRhZEJj2U7mBmy1E37NA72 o66B5r5FSlnIA8MAo/EAViCxtMJKBPRWU/WnkMhOJ1Yyw/FwMpbM2prLBLtYFqwF OLZOLuDUHY5HxdX2U+3R0hBzF78aozcH6od60SWg7wOmI/IkXYiYFujlxMd132FE OT+aa+UHHDEkATTSyt98OmxIkQ8uqKiNsSYqBk9lpNAPtmEbhqRX4RAOdrqP0G83 DX7iyZpAK4YhB4BkJxMtNdSIOnss1TwfOdKyvoBZYmY6bTKh7p+dpw4cvIjV4VDi p6+BUQdJQ7mHRLV9QI4IuG52AJO8cRGc1OVvqLEMzO8uZlpyxX9nJrSqeP/dKKfa pJ5pYssilXEeKCDnODGqSRdt9aU4ENDW/oIkAW2U3cnSHUwBoMLF1WJ+M3Atbm+s i3/iDp26SnSiHM+DVHije5v0OGOroYdJwKDIFWToElcfc9Q5IDHU+KP8oeuPqqOS WA08l+zj+ahfP7Yu1DUC =gl+r -----END PGP SIGNATURE----- Merge tag 'md/4.5' of git://neil.brown.name/md Pull md updates from Neil Brown: "Mostly clustered-raid1 and raid5 journal updates. one Y2038 fix and other minor stuff. One patch removes me from the MAINTAINERS file and adds a record of my md maintainership to Credits" Many thanks to Neil, who has been around for a _looong_ time. * tag 'md/4.5' of git://neil.brown.name/md: (26 commits) md/raid: only permit hot-add of compatible integrity profiles Remove myself as MD Maintainer, and add to Credits. raid5-cache: handle journal hotadd in quiesce MD: add journal with array suspended md: set MD_HAS_JOURNAL in correct places md: Remove 'ready' field from mddev. md: remove unnecesary md_new_event_inintr raid5: allow r5l_io_unit allocations to fail raid5-cache: use a mempool for the metadata block raid5-cache: use a bio_set raid5-cache: add journal hot add/remove support drivers: md: use ktime_get_real_seconds() md: avoid warning for 32-bit sector_t raid5-cache: free meta_page earlier raid5-cache: simplify r5l_move_io_unit_list md: update comment for md_allow_write md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY md-cluster: Protect communication with mutexes md-cluster: Defer MD reloading to mddev->thread md-cluster: update the documentation ...	2016-01-15 12:28:00 -08:00
Linus Torvalds	7d1fc01afc	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: floppy: make local variable non-static exynos: fixes an incorrect header guard dt-bindings: fixes some incorrect header guards cpufreq-dt: correct dead link in documentation cpufreq: ARM big LITTLE: correct dead link in documentation treewide: Fix typos in printk Documentation: filesystem: Fix typo in fs/eventfd.c fs/super.c: use && instead of & for warn_on condition Documentation: fix sysfs-ptp lib: scatterlist: fix Kconfig description	2016-01-14 17:04:19 -08:00
Linus Torvalds	d080827f85	libnvdimm for 4.5 1/ Media error handling: The 'badblocks' implementation that originated in md-raid is up-levelled to a generic capability of a block device. This initial implementation is limited to being consulted in the pmem block-i/o path. Later, 'badblocks' will be consulted when creating dax mappings. 2/ Raw block device dax: For virtualization and other cases that want large contiguous mappings of persistent memory, add the capability to dax-mmap a block device directly. 3/ Increased /dev/mem restrictions: Add an option to treat all io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access while a driver is actively using an address range. This behavior is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be overridden by the existing "iomem=relaxed" kernel command line option. 4/ Miscellaneous fixes include a 'pfn'-device huge page alignment fix, block device shutdown crash fix, and other small libnvdimm fixes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJWlrhjAAoJEB7SkWpmfYgCFbAQALKsQfFwT6JFS+zlPgiNpbqw 2VMNKEH0AfGYGj96mT02j2q+vSUmXLMIDMTsbe0sDdtwFZtQbFmhmryzPWUVppSu KGTlLPW8vuEhQVs91+UI3BQKkvpi0+tbR8hPOh9W6QhjpRT+lyHFKnsNR5HZy5wB K4/VMaT5ffd5/pXRTjkYiPQYTwWyfcvNjICj0YtqhPvOwS031m77JpFsWJ8HSpEX K99VlzNUPMXd1pYkHmFNXWw52fhRGNhwAEomLeKMdQfKms+KnbKp8BOSA0aCqU8E kpujQcilDXJwykFQZOFI3Z5Dxvrv8lxFTU8HRMBvo3ESzfTWjfqcvyjGOjDUcruw ihESFSJtdZzhrBiMnf9RRqSpMFJvAT8MVT6Q4D3mZUHCMPbUqFJsQjMPt9hEH3ho 4F0D2lesOCkubUKFTZmjMoDb+szuKbVhYK8TeFVVEhizinc/Aj0NKuazJqi+CXB/ xh0ER4ZxD8wvzqFFWvS5UvR1G9I5fr7+3jGRUrqGLHlSdeXP9dkEg28ao3QbWk3x 1dPOen6ZqQ9WJ/E7eGmXbVEz2R4Xd79hMXQzdQwmKDk/KbxRoAp7hyU8BslAyrBf HCdmVt+RAgrxZYfFRXuLhqwEBThJnNrgZA3qu74FUpkpFg6xRUu1bAYBiF7N+bFi 82b5UbMkveBTtkXjJoiR =7V5r -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this has appeared in -next and independently received a build success notification from the kbuild robot. The 'for-4.5/block- dax' topic branch was rebased over the weekend to drop the "block device end-of-life" rework that Al would like to see re-implemented with a notifier, and to address bug reports against the badblocks integration. There is pending feedback against "libnvdimm: Add a poison list and export badblocks" received last week. Linda identified some localized fixups that we will handle incrementally. Summary: - Media error handling: The 'badblocks' implementation that originated in md-raid is up-levelled to a generic capability of a block device. This initial implementation is limited to being consulted in the pmem block-i/o path. Later, 'badblocks' will be consulted when creating dax mappings. - Raw block device dax: For virtualization and other cases that want large contiguous mappings of persistent memory, add the capability to dax-mmap a block device directly. - Increased /dev/mem restrictions: Add an option to treat all io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access while a driver is actively using an address range. This behavior is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be overridden by the existing "iomem=relaxed" kernel command line option. - Miscellaneous fixes include a 'pfn'-device huge page alignment fix, block device shutdown crash fix, and other small libnvdimm fixes" * tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (32 commits) block: kill disk_{check\|set\|clear\|alloc}_badblocks libnvdimm, pmem: nvdimm_read_bytes() badblocks support pmem, dax: disable dax in the presence of bad blocks pmem: fail io-requests to known bad blocks libnvdimm: convert to statically allocated badblocks libnvdimm: don't fail init for full badblocks list block, badblocks: introduce devm_init_badblocks block: clarify badblocks lifetime badblocks: rename badblocks_free to badblocks_exit libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h libnvdimm: Add a poison list and export badblocks nfit_test: Enable DSMs for all test NFITs md: convert to use the generic badblocks code block: Add badblock management for gendisks badblocks: Add core badblock management code block: fix del_gendisk() vs blkdev_ioctl crash block: enable dax for raw block devices block: introduce bdev_file_inode() restrict /dev/mem to idle io memory ranges arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug ...	2016-01-13 19:15:14 -08:00
Dan Williams	1501efadc5	md/raid: only permit hot-add of compatible integrity profiles It is not safe for an integrity profile to be changed while i/o is in-flight in the queue. Prevent adding new disks or otherwise online spares to an array if the device has an incompatible integrity profile. The original change to the blk_integrity_unregister implementation in md, commmit `c7bfced9a6` "md: suspend i/o during runtime blk_integrity_unregister" introduced an immediate hang regression. This policy of disallowing changes the integrity profile once one has been established is shared with DM. Here is an abbreviated log from a test run that: 1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [ 59.076127] 2/ Tries to add an integrity-disabled device (pmem1m) [ 90.489209] 3/ Retries with an integrity-enabled device (pmem1s) [ 205.671277] [ 59.076127] md/raid1:md0: active with 1 out of 2 mirrors [ 59.078302] md: data integrity enabled on md0 [..] [ 90.489209] md0: incompatible integrity profile for pmem1m [..] [ 205.671277] md: super_written gets error=-5 [ 205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device. [ 205.677386] md/raid1:md0: Operation continuing on 1 devices. [ 205.683037] RAID1 conf printout: [ 205.684699] --- wd:1 rd:2 [ 205.685972] disk 0, wo:0, o:1, dev:pmem0s [ 205.687562] disk 1, wo:1, o:1, dev:pmem1s [ 205.691717] md: recovery of RAID array md0 Fixes: `c7bfced9a6` ("md: suspend i/o during runtime blk_integrity_unregister") Cc: <stable@vger.kernel.org> Cc: Mike Snitzer <snitzer@redhat.com> Reported-by: NeilBrown <neilb@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-14 11:49:57 +11:00
Shaohua Li	16a43f6a65	raid5-cache: handle journal hotadd in quiesce Handle journal hotadd in quiesce to avoid creating duplicated threads. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-14 11:49:43 +11:00
Shaohua Li	87d4d91616	MD: add journal with array suspended Hot add journal disk in recovery thread context brings a lot of trouble as IO could be running. Unlike spare disk hot add, adding journal disk with array suspended makes more sense and implmentation is much easier. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-14 11:49:43 +11:00
Shaohua Li	a62ab49eb5	md: set MD_HAS_JOURNAL in correct places Set MD_HAS_JOURNAL when a array is loaded or journal is initialized. This is to avoid the flags set too early in journal disk hotadd. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-14 11:49:43 +11:00
Linus Torvalds	33caf82acf	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "All kinds of stuff. That probably should've been 5 or 6 separate branches, but by the time I'd realized how large and mixed that bag had become it had been too close to -final to play with rebasing. Some fs/namei.c cleanups there, memdup_user_nul() introduction and switching open-coded instances, burying long-dead code, whack-a-mole of various kinds, several new helpers for ->llseek(), assorted cleanups and fixes from various people, etc. One piece probably deserves special mention - Neil's lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets called without ->i_mutex and tries to avoid ever taking it. That, of course, means that it's not useful for any directory modifications, but things like getting inode attributes in nfds readdirplus are fine with that. I really should've asked for moratorium on lookup-related changes this cycle, but since I hadn't done that early enough... I am asking for that for the coming cycle, though - I'm going to try and get conversion of i_mutex to rwsem with ->lookup() done under lock taken shared. There will be a patch closer to the end of the window, along the lines of the one Linus had posted last May - mechanical conversion of ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/ inode_is_locked()/inode_lock_nested(). To quote Linus back then: ----- \| This is an automated patch using \| \| sed 's/mutex_lock(&\(.\)->i_mutex)/inode_lock(\1)/' \| sed 's/mutex_unlock(&\(.\)->i_mutex)/inode_unlock(\1)/' \| sed 's/mutex_lock_nested(&\(.\)->i_mutex,[ ]I_MUTEX_\([A-Z0-9_]\))/inode_lock_nested(\1, I_MUTEX_\2)/' \| sed 's/mutex_is_locked(&\(.\)->i_mutex)/inode_is_locked(\1)/' \| sed 's/mutex_trylock(&\(.\)->i_mutex)/inode_trylock(\1)/' \| \| with a very few manual fixups ----- I'm going to send that once the ->i_mutex-affecting stuff in -next gets mostly merged (or when Linus says he's about to stop taking merges)" 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) nfsd: don't hold i_mutex over userspace upcalls fs:affs:Replace time_t with time64_t fs/9p: use fscache mutex rather than spinlock proc: add a reschedule point in proc_readfd_common() logfs: constify logfs_block_ops structures fcntl: allow to set O_DIRECT flag on pipe fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE fs: xattr: Use kvfree() [s390] page_to_phys() always returns a multiple of PAGE_SIZE nbd: use ->compat_ioctl() fs: use block_device name vsprintf helper lib/vsprintf: add %*pg format specifier fs: use gendisk->disk_name where possible poll: plug an unused argument to do_poll amdkfd: don't open-code memdup_user() cdrom: don't open-code memdup_user() rsxx: don't open-code memdup_user() mtip32xx: don't open-code memdup_user() [um] mconsole: don't open-code memdup_user_nul() [um] hostaudio: don't open-code memdup_user() ...	2016-01-12 17:11:47 -08:00
Linus Torvalds	03891f9c85	- The most significant set of changes this cycle is the Forward Error Correction (FEC) support that has been added to the DM verity target. Google uses DM verity on all Android devices and it is believed that this FEC support will enable DM verity to recover from storage failures seen since DM verity was first deployed as part of Android. - A stable fix for a race in the destruction of DM thin pool's workqueue - A stable fix for hung IO if a DM snapshot copy hit an error - A few small cleanups in DM core and DM persistent data. - A couple DM thinp range discard improvements (address atomicity of finding a range and the efficiency of discarding a partially mapped thin device) - Add ability to debug DM bufio leaks by recording stack trace when a buffer is allocated. Upon detected leak the recorded stack is dumped. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJWlAf0AAoJEMUj8QotnQNaHPIH/2rvnzg71RsP/7IRI/5DHETP ubxKhKd7tfqwJEjuQvhiYB1Ubo+gvXEuT51C2G1ug2QzHsjymmE14q/60ElB7+/U ++bGisWvqm4ZqWWM9yffqbESzNOfNTn7dLduaxGeLxVG3zVLfzQRfSPOqhk1FiIv H35v0Xx/j1NAHQtcocVYzG4P5BwfgmeyuYmUq8BklHNlwa3drBKnMZfIlF4u2216 Z3K7d+5nLpSsPyejzpQlByHTUt/eVy1Y2ZBgudWITaP5DAcUQwHyLZI4k3skmMiK O/xLZ54aeKI9NhtEwH8s8jOd3b7Kvw/oAw5nfPj7jmIDF3if8U2HCU6KgfBVwwU= =fOsS -----END PGP SIGNATURE----- Merge tag 'dm-4.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - The most significant set of changes this cycle is the Forward Error Correction (FEC) support that has been added to the DM verity target. Google uses DM verity on all Android devices and it is believed that this FEC support will enable DM verity to recover from storage failures seen since DM verity was first deployed as part of Android. - A stable fix for a race in the destruction of DM thin pool's workqueue - A stable fix for hung IO if a DM snapshot copy hit an error - A few small cleanups in DM core and DM persistent data. - A couple DM thinp range discard improvements (address atomicity of finding a range and the efficiency of discarding a partially mapped thin device) - Add ability to debug DM bufio leaks by recording stack trace when a buffer is allocated. Upon detected leak the recorded stack is dumped. * tag 'dm-4.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm snapshot: fix hung bios when copy error occurs dm thin: bump thin and thin-pool target versions dm thin: fix race condition when destroying thin pool workqueue dm space map metadata: remove unused variable in brb_pop() dm verity: add ignore_zero_blocks feature dm verity: add support for forward error correction dm verity: factor out verity_for_bv_block() dm verity: factor out structures and functions useful to separate object dm verity: move dm-verity.c to dm-verity-target.c dm verity: separate function for parsing opt args dm verity: clean up duplicate hashing code dm btree: factor out need_insert() helper dm bufio: use BUG_ON instead of conditional call to BUG dm bufio: store stacktrace in buffers to help find buffer leaks dm bufio: return NULL to improve code clarity dm block manager: cleanup code that prints stacktrace dm: don't save and restore bi_private dm thin metadata: make dm_thin_find_mapped_range() atomic dm thin metadata: speed up discard of partially mapped volumes	2016-01-11 22:25:00 -08:00
Dan Williams	d3b407fb3f	badblocks: rename badblocks_free to badblocks_exit For symmetry with badblocks_init() make it clear that this path only destroys incremental allocations of a badblocks instance, and does not free the badblocks instance itself. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-09 08:39:04 -08:00
Vishal Verma	fc974ee2bf	md: convert to use the generic badblocks code Retain badblocks as part of rdev, but use the accessor functions from include/linux/badblocks for all manipulation. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-09 08:39:03 -08:00
Mikulas Patocka	385277bfb5	dm snapshot: fix hung bios when copy error occurs When there is an error copying a chunk dm-snapshot can incorrectly hold associated bios indefinitely, resulting in hung IO. The function copy_callback sets pe->error if there was error copying the chunk, and then calls complete_exception. complete_exception calls pending_complete on error, otherwise it calls commit_exception with commit_callback (and commit_callback calls complete_exception). The persistent exception store (dm-snap-persistent.c) assumes that calls to prepare_exception and commit_exception are paired. persistent_prepare_exception increases ps->pending_count and persistent_commit_exception decreases it. If there is a copy error, persistent_prepare_exception is called but persistent_commit_exception is not. This results in the variable ps->pending_count never returning to zero and that causes some pending exceptions (and their associated bios) to be held forever. Fix this by unconditionally calling commit_exception regardless of whether the copy was successful. A new "valid" parameter is added to commit_exception -- when the copy fails this parameter is set to zero so that the chunk that failed to copy (and all following chunks) is not recorded in the snapshot store. Also, remove commit_callback now that it is merely a wrapper around pending_complete. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-01-08 20:03:05 -05:00
Mike Snitzer	1c2e54e1ed	dm thin: bump thin and thin-pool target versions Commit `3d5f6733` ("dm thin metadata: speed up discard of partially mapped volumes"), or some other dm-thinp change during the Linux 4.5 development window, really should've bumped these target versions. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-01-06 20:59:40 -05:00
NeilBrown	274d8cbde1	md: Remove 'ready' field from mddev. This field is always set in tandem with ->pers, and when it is tested ->pers is also tested. So ->ready is not needed. It was needed once, but code rearrangement and locking changes have removed that needed. Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-07 11:01:14 +11:00
Guoqing Jiang	bb9ef71646	md: remove unnecesary md_new_event_inintr md_new_event had removed sysfs_notify since 'commit `72a23c211e` ("Make sure all changes to md/sync_action are notified.")', so we can use md_new_event and delete md_new_event_inintr. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-07 11:01:14 +11:00
Christoph Hellwig	5036c39020	raid5: allow r5l_io_unit allocations to fail And propagate the error up the stack so we can add the stripe to no_stripes_list and retry our log operation later. This avoids blocking raid5d due to reclaim, an it allows to get rid of the deadlock-prone GFP_NOFAIL allocation. shli: add missing mempool_destroy() Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:40:12 +11:00
Christoph Hellwig	e8deb63810	raid5-cache: use a mempool for the metadata block We only have a limited number in flight, so use a page based mempool. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:40:08 +11:00
Christoph Hellwig	c38d29b33b	raid5-cache: use a bio_set This allows us to make guaranteed forward progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:40:04 +11:00
Shaohua Li	f6b6ec5cfa	raid5-cache: add journal hot add/remove support Add support for journal disk hot add/remove. Mostly trival checks in md part. The raid5 part is a little tricky. For hot-remove, we can't wait pending write as it's called from raid5d. The wait will cause deadlock. We simplily fail the hot-remove. A hot-remove retry can success eventually since if journal disk is faulty all pending write will be failed and finish. For hot-add, since an array supporting journal but without journal disk will be marked read-only, we are safe to hot add journal without stopping IO (should be read IO, while journal only handles write IO). Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:57 +11:00
Deepa Dinamani	9ebc6ef188	drivers: md: use ktime_get_real_seconds() get_seconds() API is not y2038 safe on 32 bit systems and the API is deprecated. Replace it with calls to ktime_get_real_seconds() API instead. Change mddev structure types to time64_t accordingly. 32 bit signed timestamps will overflow in the year 2038. Change the user interface mdu_array_info_s structure timestamps: ctime and utime values used in ioctls GET_ARRAY_INFO and SET_ARRAY_INFO to unsigned int. This will extend the field to last until the year 2106. The long term plan is to get rid of ctime and utime values in this structure as this information can be read from the on-disk meta data directly. Clamp the tim64_t timestamps to positive values with a max of U32_MAX when returning from GET_ARRAY_INFO ioctl to accommodate above changes in the data type of timestamps to unsigned int. v0.90 on disk meta data uses u32 for maintaining time stamps. So this will also last until year 2106. Assumption is that the usage of v0.90 will be deprecated by year 2106. Timestamp fields in the on disk meta data for v1.0 version already use 64 bit data types. Remove the truncation of the bits while writing to or reading from these from the disk. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:53 +11:00
Arnd Bergmann	3312c951ef	md: avoid warning for 32-bit sector_t When CONFIG_LBDAF is not set, sector_t is only 32-bits wide, which means we cannot have devices with more than 2TB, and the code that is trying to handle compatibility support for large devices in md version 0.90 is meaningless but also causes a compile-time warning: drivers/md/md.c: In function 'super_90_load': drivers/md/md.c:1029:19: warning: large integer implicitly truncated to unsigned type [-Woverflow] drivers/md/md.c: In function 'super_90_rdev_size_change': drivers/md/md.c:1323:17: warning: large integer implicitly truncated to unsigned type [-Woverflow] This adds a check for CONFIG_LBDAF to avoid even getting into this code path, and also adds an explicit cast to let the compiler know it doesn't have to warn about the truncation. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:48 +11:00
Christoph Hellwig	ad66d445ee	raid5-cache: free meta_page earlier Once the I/O completed we don't need the meta page anymore. As the iounits can live on for a long time this reduces memory pressure a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:43 +11:00
Christoph Hellwig	3848c0bcb0	raid5-cache: simplify r5l_move_io_unit_list It's only used for one kind of move, so make that explicit. Also clean up the code a bit by using list_for_each_safe. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:34 +11:00
Guoqing Jiang	abf3508d8f	md: update comment for md_allow_write MD_CHANGE_CLEAN had been replaced with MD_CHANGE_PENDING after commit 070dc6 ("md: resolve confusion of MD_CHANGE_CLEAN"), so make the change accordingly. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:26 +11:00
Guoqing Jiang	e19508fa4d	md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY 1. fix unbalanced parentheses. 2. add more description about that MD_CLUSTER_SEND_LOCKED_ALREADY will be cleared after set it in add_new_disk. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:21 +11:00
Guoqing Jiang	8b9277c814	md-cluster: Protect communication with mutexes Communication can happen through multiple threads. It is possible that one thread steps over another threads sequence. So, we use mutexes to protect both the send and receive sequences. Send communication is locked through state bit, MD_CLUSTER_SEND_LOCK. Communication is locked with bit manipulation in order to allow "lock and hold" for the add operation. In case of an add operation, if the lock is held, MD_CLUSTER_SEND_LOCKED_ALREADY is set. When md_update_sb() calls metadata_update_start(), it checks (in a single statement to avoid races), if the communication is already locked. If yes, it merely returns zero, else it locks the token lockresource. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:17 +11:00
Guoqing Jiang	15858fa5b0	md-cluster: Defer MD reloading to mddev->thread Reloading of superblock must be performed under reconfig_mutex. However, this cannot be done with md_reload_sb because it would deadlock with the message DLM lock. So, we defer it in md_check_recovery() which is executed by mddev->thread. This introduces a new flag, MD_RELOAD_SB, which if set, will reload the superblock. And good_device_nr is also added to 'struct mddev' which is used to get the num of the good device within cluster raid. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:10 +11:00
Guoqing Jiang	f6a2dc64ee	md-cluster: append some actions when change bitmap from clustered to none For clustered raid, we need to do extra actions when change bitmap to none. 1. check if all the bitmap lock could be get or not, if yes then we can continue the change since cluster raid is only active in current node. Otherwise return fail and unlock the related bitmap locks 2. set nodes to 0 and then leave cluster environment. 3. release other nodes's bitmap lock. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:57 +11:00
Goldwyn Rodrigues	09afd2a8d6	md-cluster: Allow spare devices to be marked as faulty If a spare device was marked faulty, it would not be reflected in receiving nodes because it would mark it as activated and continue. Continue the operation, so it may be set as faulty. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:51 +11:00
Goldwyn Rodrigues	54a88392cd	md-cluster: Fix the remove sequence with the new MD reload code The remove disk message does not need metadata_update_start(), but can be an independent message. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:42 +11:00
Guoqing Jiang	659b254fa7	md-cluster: remove a disk asynchronously from cluster environment For cluster raid, if one disk couldn't be reach in one node, then other nodes would receive the REMOVE message for the disk. In receiving node, we can't call md_kick_rdev_from_array to remove the disk from array synchronously since the disk might still be busy in this node. So let's set a ClusterRemove flag on the disk, then let the thread to do the removal job eventually. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:36 +11:00
Goldwyn Rodrigues	ac277c6a8a	md-cluster: Avoid the resync ping-pong If a RESYNCING message with (0,0) has been sent before, do not send it again. This avoids a resync ping pong between the nodes. We read the bitmap lockresource's LVB to figure out the previous value of the RESYNCING message. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:27 +11:00
Roman Gushchin	b46020aa3a	md/raid5: remove redundant check in stripe_add_to_batch_list() The stripe_add_to_batch_list() function is called only if stripe_can_batch() returned true, so there is no need for double check. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Neil Brown <neilb@suse.com> Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:22 +11:00
Al Viro	93bbf5831d	md: more open-coded offset_in_page() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:29:12 -05:00
Al Viro	756d097b95	dm-bufio: virt_to_phys() doesn't change remainder modulo PAGE_SIZE ... so virt_to_phys(p) & (PAGE_SIZE - 1) is a very odd way to spell offset_in_page(p). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:29:07 -05:00
NeilBrown	312045eef9	md: remove check for MD_RECOVERY_NEEDED in action_store. md currently doesn't allow a 'sync_action' such as 'reshape' to be set while MD_RECOVERY_NEEDED is set. This s a problem, particularly since commit `738a273806` as that can cause ->check_shape to call mddev_resume() which sets MD_RECOVERY_NEEDED. So by the time we come to start 'reshape' it is very likely that MD_RECOVERY_NEEDED is still set. Testing for this flag is not really needed and is in any case very racy as it can be set at any moment - asynchronously. Any race between setting a sync_action and setting MD_RECOVERY_NEEDED must already be handled properly in some locked code, probably md_check_recovery(), so remove the test here. The test on MD_RECOVERY_RUNNING is also racy in the 'reshape' case so we should test it again after getting mddev_lock(). As this fixes a race and a regression which can cause 'reshape' to fail, it is suitable for -stable kernels since 4.1 Reported-by: Xiao Ni <xni@redhat.com> Fixes: `738a273806` ("md/raid5: fix allocation of 'scribble' array.") Cc: stable@vger.kernel.org (v4.1+) Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-21 11:10:06 +11:00
Goldwyn Rodrigues	cb01c5496d	Fix remove_and_add_spares removes drive added as spare in slot_store Commit `2910ff17d1` introduced a regression which would remove a recently added spare via slot_store. Revert part of the patch which touches slot_store() and add the disk directly using pers->hot_add_disk() Fixes: `2910ff17d1` ("md: remove_and_add_spares() to activate specific rdev") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-18 15:19:16 +11:00
Mikulas Patocka	0dc10e50f2	md: fix bug due to nested suspend The patch `c7bfced9a6` committed to 4.4-rc causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with this BUG, the reason is that we attempt to suspend a device that is already suspended. See also https://bugzilla.redhat.com/show_bug.cgi?id=1283491 This patch fixes the bug by changing functions mddev_suspend and mddev_resume to always nest. The number of nested calls to mddev_nested_suspend is kept in the variable mddev->suspended. [neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend] kernel BUG at drivers/md/md.c:317! CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1 task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000 YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00001000000001000000000000001111 Not tainted r00-03 000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810 r04-07 0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810 r08-11 000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff r12-15 0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4 r16-19 00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001 r20-23 0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1 r24-27 00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000 r28-31 0000000000000001 0000000047014840 0000000047014930 0000000000000001 sr00-03 0000000007040800 0000000000000000 0000000000000000 0000000007040800 sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390 IIR: 03ffe01f ISR: 0000000000000000 IOR: 00000000102b2748 CPU: 3 CR30: 0000000047014000 CR31: 0000000000000000 ORIG_R28: 00000000000000b0 IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod] IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod] RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1] Backtrace: [<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1] [<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid] [<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod] [<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod] [<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod] [<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod] [<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod] [<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558 Fixes: `c7bfced9a6` ("md: suspend i/o during runtime blk_integrity_unregister") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-18 15:19:16 +11:00
Shaohua Li	9b15603dbd	MD: change journal disk role to disk 0 Neil pointed out setting journal disk role to raid_disks will confuse reshape if we support reshape eventually. Switching the role to 0 (we should be fine as long as the value >=0) and skip sysfs file creation to avoid error. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-18 15:19:16 +11:00
Artur Paszkiewicz	cc57858831	md/raid10: fix data corruption and crash during resync The commit `c31df25f20` ("md/raid10: make sync_request_write() call bio_copy_data()") replaced manual data copying with bio_copy_data() but it doesn't work as intended. The source bio (fbio) is already processed, so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt. Because of this, bio_copy_data() either does not copy anything, or worse, copies data from the ->bi_next bio if it is set. This causes wrong data to be written to drives during resync and sometimes lockups/crashes in bio_copy_data(): [ 517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319] [ 517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [ 517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1 [ 517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015 [ 517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000 [ 517.468529] RIP: 0010:[<ffffffff812e1888>] [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0 [ 517.478164] RSP: 0018:ffff880150dfbc98 EFLAGS: 00000246 [ 517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000 [ 517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0 [ 517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980 [ 517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000 [ 517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000 [ 517.525412] FS: 0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000 [ 517.534844] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0 [ 517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 517.566144] Stack: [ 517.568626] ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000 [ 517.577659] 0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800 [ 517.586715] ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000 [ 517.595773] Call Trace: [ 517.598747] [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10] [ 517.605610] [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2 [ 517.611987] [<ffffffff814ff206>] md_thread+0x106/0x140 [ 517.618072] [<ffffffff810c1d80>] ? wait_woken+0x80/0x80 [ 517.624252] [<ffffffff814ff100>] ? super_1_load+0x520/0x520 [ 517.630817] [<ffffffff8109ef89>] kthread+0xc9/0xe0 [ 517.636506] [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70 [ 517.643653] [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70 [ 517.649929] [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70 Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: Shaohua Li <shli@kernel.org> Cc: stable@vger.kernel.org (v4.2+) Fixes: `c31df25f20` ("md/raid10: make sync_request_write() call bio_copy_data()") Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-18 15:19:16 +11:00
Nikolay Borisov	18d03e8c25	dm thin: fix race condition when destroying thin pool workqueue When a thin pool is being destroyed delayed work items are cancelled using cancel_delayed_work(), which doesn't guarantee that on return the delayed item isn't running. This can cause the work item to requeue itself on an already destroyed workqueue. Fix this by using cancel_delayed_work_sync() which guarantees that on return the work item is not running anymore. Fixes: `905e51b39a` ("dm thin: commit outstanding data every second") Fixes: `85ad643b7e` ("dm thin: add timeout to stop out-of-data-space mode holding IO forever") Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-17 15:47:20 -05:00
Mike Snitzer	512167788a	dm space map metadata: remove unused variable in brb_pop() Remove the unused struct block_op pointer that was inadvertantly introduced, via cut-and-paste of previous brb_op() code, as part of commit `50dd842ad`. (Cc'ing stable@ because commit `50dd842ad` did) Fixes: `50dd842ad` ("dm space map metadata: fix ref counting bug when bootstrapping a new space map") Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-14 09:26:01 -05:00
Sami Tolvanen	0cc37c2df4	dm verity: add ignore_zero_blocks feature If ignore_zero_blocks is enabled dm-verity will return zeroes for blocks matching a zero hash without validating the content. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:03 -05:00
Sami Tolvanen	a739ff3f54	dm verity: add support for forward error correction Add support for correcting corrupted blocks using Reed-Solomon. This code uses RS(255, N) interleaved across data and hash blocks. Each error-correcting block covers N bytes evenly distributed across the combined total data, so that each byte is a maximum distance away from the others. This makes it possible to recover from several consecutive corrupted blocks with relatively small space overhead. In addition, using verity hashes to locate erasures nearly doubles the effectiveness of error correction. Being able to detect corrupted blocks also improves performance, because only corrupted blocks need to corrected. For a 2 GiB partition, RS(255, 253) (two parity bytes for each 253-byte block) can correct up to 16 MiB of consecutive corrupted blocks if erasures can be located, and 8 MiB if they cannot, with 16 MiB space overhead. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:03 -05:00
Sami Tolvanen	bb4d73ac5e	dm verity: factor out verity_for_bv_block() verity_for_bv_block() will be re-used by optional dm-verity object. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:02 -05:00
Sami Tolvanen	ffa393807c	dm verity: factor out structures and functions useful to separate object Prepare for an optional verity object to make use of existing dm-verity structures and functions. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:01 -05:00
Sami Tolvanen	03045cbafa	dm verity: move dm-verity.c to dm-verity-target.c Prepare for extending dm-verity with an optional object. Follows the naming convention used by other DM targets (e.g. dm-cache and dm-era). Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:01 -05:00
Sami Tolvanen	753c1fd028	dm verity: separate function for parsing opt args Move optional argument parsing into a separate function to make it easier to add more of them without making verity_ctr even longer. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:00 -05:00
Sami Tolvanen	6dbeda3469	dm verity: clean up duplicate hashing code Handle dm-verity salting in one place to simplify the code. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:00 -05:00

1 2 3 4 5 ...

4052 Commits