OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Jes Sorensen	e8ff8bf09f	md/raid1: Avoid raid1 resync getting stuck close_sync() needs to set conf->next_resync to a large, but safe value below MaxSector and use it to determine whether or not to set start_next_window in wait_barrier() Solution suggested by Neil Brown. Reported-by: Nate Dailey <nate.dailey@stratus.com> Tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:44 +10:00
Julia Lawall	644df1a85f	md: drop null test before destroy functions Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) \(kmem_cache_destroy\\|mempool_destroy\\|dma_pool_destroy\)(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:44 +10:00
Shaohua Li	d4929add83	md: clear CHANGE_PENDING in readonly array If faulty disks of an array are more than allowed degraded number, the array enters error handling. It will be marked as read-only with MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is set for a raid5 array, all returned IO will be hold on a list till the bit is clear. But recovery nevery clears this bit, the IO is always in pending state and nevery finish. This has bad effects like upper layer can't get an IO error and the array can't be stopped. Fixes: `c3cce6cda1` ("md/raid5: ensure device failure recorded before write request returns.") Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:44 +10:00
NeilBrown	66eefe5de1	md/raid0: apply base queue limits before disk_stack_limits Calling e.g. blk_queue_max_hw_sectors() after calls to disk_stack_limits() discards the settings determined by disk_stack_limits(). So we need to make those calls first. Fixes: `199dc6ed51` ("md/raid0: update queue parameter in a safer location.") Cc: stable@vger.kernel.org (v2.6.35+ - please apply with `199dc6ed51`). Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:44 +10:00
NeilBrown	36707bb2e7	md/raid5: don't index beyond end of array in need_this_block(). When need_this_block probably shouldn't be called when there are more than 2 failed devices, we really don't want it to try indexing beyond the end of the failed_num[] of fdev[] arrays. So limit the loops to at most 2 iterations. Reported-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-10-02 17:23:43 +10:00
Shaohua Li	ebda780bce	raid5: update analysis state for failed stripe handle_failed_stripe() makes the stripe fail, eg, all IO will return with a failure, but it doesn't update stripe_head_state. Later handle_stripe() has special handling for raid6 for handle_stripe_fill(). That check before handle_stripe_fill() doesn't skip the failed stripe and we get a kernel crash in need_this_block. This patch clear the analysis state to make sure no functions wrongly called after handle_failed_stripe() Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:43 +10:00
NeilBrown	88724bfa68	md: wait for pending superblock updates before switching to read-only If a superblock update is pending, wait for it to complete before letting md_set_readonly() switch to readonly. Otherwise we might lose important information about a device having failed. For external arrays, waiting for superblock updates can wait on user-space, so in that case, just return an error. Reported-and-tested-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:43 +10:00
Junichi Nomura	2a708cff93	dm: fix AB-BA deadlock in __dm_destroy() __dm_destroy() takes io_barrier SRCU lock (dm_get_live_table) and suspend_lock in reverse order. Doing so can cause AB-BA deadlock: __dm_destroy dm_swap_table --------------------------------------------------- mutex_lock(suspend_lock) dm_get_live_table() srcu_read_lock(io_barrier) dm_sync_table() synchronize_srcu(io_barrier) .. waiting for dm_put_live_table() mutex_lock(suspend_lock) .. waiting for suspend_lock Fix this by taking the locks in proper order. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Fixes: `ab7c7bb6f4` ("dm: hold suspend_lock while suspending device during device deletion") Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-10-01 10:40:20 -04:00
Mike Snitzer	586b286b11	dm crypt: constrain crypt device's max_segment_size to PAGE_SIZE Setting the dm-crypt device's max_segment_size to PAGE_SIZE is an unfortunate constraint that is required to avoid the potential for exceeding dm-crypt's underlying device's max_segments limits -- due to crypt_alloc_buffer() possibly allocating pages for the encryption bio that are not as physically contiguous as the original bio. It is interesting to note that this problem was already fixed back in 2007 via commit `91e106259` ("dm crypt: use bio_add_page"). But Linux 4.0 commit `cf2f1abfb` ("dm crypt: don't allocate pages for a partial request") regressed dm-crypt back to _not_ using bio_add_page(). But given dm-crypt's cpu parallelization changes all depend on commit cf2f1abfb's abandoning of the more complex io fragments processing that dm-crypt previously had we cannot easily go back to using bio_add_page(). So all said the cleanest way to resolve this issue is to fix dm-crypt to properly constrain the original bios entering dm-crypt so the encryption bios that dm-crypt generates from the original bios are always compatible with the underlying device's max_segments queue limits. It should be noted that technically Linux 4.3 does _not_ need this fix because of the block core's new late bio-splitting capability. But, it is reasoned, there is little to be gained by having the block core split the encrypted bio that is composed of PAGE_SIZE segments. That said, in the future we may revert this change. Fixes: `cf2f1abfb` ("dm crypt: don't allocate pages for a partial request") Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=104421 Suggested-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.0+	2015-09-14 12:04:24 -04:00
Mike Snitzer	216076705d	dm thin: disable discard support for thin devices if pool's is disabled If the pool is configured with 'ignore_discard' its discard support is disabled. The pool's thin devices should also have queue_limits that reflect discards are disabled. Fixes: `34fbcf62` ("dm thin: range discard support") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+	2015-09-13 21:32:10 -04:00
Linus Torvalds	8e78b7dc93	SCSI misc on 20150911 The major pieces of this patch are a set patches facilitating better integration between scsi and scsi_dh (the device handling layer used by multi-path; all the dm parts are acked by Mike Snitzer). It also includes driver updates for mp3sas, scsi_debug and an assortment of bug fixes. Signed-off-by: James Bottomley <JBottomley@Odin.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABAgAGBQJV8yt5AAoJEDeqqVYsXL0MBsQH+wXvlx3o0BGuz5ZXfIs/RxzI MwGnu1J0LSA9FPakkMUVOBtsxIG+pCV+4eKorQMkfGCKAZ8daaYsyYvSEM2mcqIX 1Y/srEnbzfE94JHbsI2pbiMPkB7QdtW27WjTSjQGgD9igAyVmmITiQJrXbpAlSLF F6n++9avng+GhjXQ5TF8/y13OYgabIoAPM1j4B/ut/Ok8ReruBvMBnOla5w5RMKR rBZKTZfUwvX5S0cuREwj8tFsRVUgdBNSrcGswFJrZo5x9WAsSHLC6+SOLZuUy1vC ua0tNtEiyXiuR0/jSP9qv7hJ/j0BW+EGdnW6GZEzKpeMK5PxfVspOsbNunUDRsY= =Y9G1 -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull second round of SCSI updates from James Bottomley: "There's one late arriving patch here (added today), fixing a build issue which the scsi_dh patch set in here uncovered. Other than that, everything has been incubated in -next and the checkers for a week. The major pieces of this patch are a set patches facilitating better integration between scsi and scsi_dh (the device handling layer used by multi-path; all the dm parts are acked by Mike Snitzer). This also includes driver updates for mp3sas, scsi_debug and an assortment of bug fixes" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (50 commits) scsi_dh: fix randconfig build error scsi: fix scsi_error_handler vs. scsi_host_dev_release race fcoe: Convert use of __constant_htons to htons mpt2sas: setpci reset kernel oops fix pm80xx: Don't override ts->stat on IO_OPEN_CNX_ERROR_HW_RESOURCE_BUSY lpfc: Fix possible use-after-free and double free in lpfc_mbx_cmpl_rdp_page_a2() bfa: Fix incorrect de-reference of pointer bfa: Fix indentation scsi_transport_sas: Remove check for SAS expander when querying bay/enclosure IDs. scsi_debug: resp_request: remove unused variable scsi_debug: fix REPORT LUNS Well Known LU scsi_debug: schedule_resp fix input variable check scsi_debug: make dump_sector static scsi_debug: vfree is null safe so drop the check scsi_debug: use SCSI_W_LUN_REPORT_LUNS instead of SAM2_WLUN_REPORT_LUNS; scsi_debug: define pr_fmt() for consistent logging mpt2sas: Refcount fw_events and fix unsafe list usage mpt2sas: Refcount sas_device objects and fix unsafe list usage scsi_dh: return SCSI_DH_NOTCONN in scsi_dh_activate() scsi_dh: don't allow to detach device handlers at runtime ...	2015-09-11 18:15:18 -07:00
Christoph Hellwig	294ab783ad	scsi_dh: fix randconfig build error It looks like the Kconfig check that was meant to fix this (commit `fe9233fb69` [SCSI] scsi_dh: fix kconfig related build errors) was actually reversed, but no-one noticed until the new set of patches which separated DM and SCSI_DH). Fixes: `fe9233fb69` Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: James Bottomley <JBottomley@Odin.com>	2015-09-11 11:54:37 -07:00
Linus Torvalds	2a013e37ce	md updates for 4.3 - An assortment of little fixes, several for minor races only likely to be hit during testing - further cluster-md-raid1 development, not ready for real use yet. - new RAID6 syndrome code for ARM NEON - fix a race where a write can return before failure of one device is properly recorded in metadata, so an immediate crash might result in that write being lost. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJV6rSXAAoJEDnsnt1WYoG5eJIQAJs62+kB3+p87/VEu4hiBgYv yyaCBlTDn3xxy3WFLtvSIc+cZOamvGe/u/+9/aTA5zq30VpS0fwZlLUxwyR3vB7H aXh5y0JL8fViCUp6o+SplOpNDMAv4ntcW5NMv7uWhPLxtQxF/IJu6YLsDRcFaJqL LFCpvKSPgXOQ88ZXHa54xgFgEy+aAh1lxaWQmeqCLtgVc6YhwIsazG00R/vow8Pb 91u3jFioWjBpovTJiRxQO+NGemfOnKrm2EWkR4jzo8taHOouBWOH0RZjh/67dh4p QX4GjMINhFvYSr1UMGXfPm+Fjp2PRgx1qKyR/XhPeXNuE2xZf7T4aCnmKA8DVUA4 vyEl/l0lAZClExNA+bgE/wCrMpvtb9E4NnklzIffDqsDY79m9JzLwznYqDQcXP7m 0zPlRmf8KQoSOVV960N2O6siwQMwvTyPecG0raAv9BKjwZ+7/M8HLOplZuuMsbzT BZ6+FnAIDtc0Id0wwJoARUkghG7Nr4IWi4Q8MtyYLgH9KLnYkomjf/I2B5sEooCF JFIXeg+XX/xKSFHV4TycYdAFMtMEJMJ/pEnbKJ/W7CyAmrHJv+0/U+/gOkA8Mg76 iqYVWqRJHP9ZyWpmaWaaOeGIgFoJqrjM65qFNRcOnzMd/aAi8W63oyM99Lxi+1pm i8StqQBNtiwzds/w32SI =n+/k -----END PGP SIGNATURE----- Merge tag 'md/4.3' of git://neil.brown.name/md Pull md updates from Neil Brown: - an assortment of little fixes, several for minor races only likely to be hit during testing - further cluster-md-raid1 development, not ready for real use yet. - new RAID6 syndrome code for ARM NEON - fix a race where a write can return before failure of one device is properly recorded in metadata, so an immediate crash might result in that write being lost. * tag 'md/4.3' of git://neil.brown.name/md: (33 commits) md/raid5: ensure device failure recorded before write request returns. md/raid5: use bio_list for the list of bios to return. md/raid10: ensure device failure recorded before write request returns. md/raid1: ensure device failure recorded before write request returns. md-cluster: remove inappropriate try_module_get from join() md: extend spinlock protection in register_md_cluster_operations md-cluster: Read the disk bitmap sb and check if it needs recovery md-cluster: only call complete(&cinfo->completion) when node join cluster md-cluster: add missed lockres_free md-cluster: remove the unused sb_lock md-cluster: init suspend_list and suspend_lock early in join md-cluster: add the error check if failed to get dlm lock md-cluster: init completion within lockres_init md-cluster: fix deadlock issue on message lock md-cluster: transfer the resync ownership to another node md-cluster: split recover_slot for future code reuse md-cluster: use %pU to print UUIDs md: setup safemode_timer before it's being used md/raid5: handle possible race as reshape completes. md: sync sync_completed has correct value as recovery finishes. ...	2015-09-05 17:52:22 -07:00
NeilBrown	e89c6fdf9e	Merge linux-block/for-4.3/core into md/for-linux There were a few conflicts that are fairly easy to resolve. Signed-off-by: NeilBrown <neilb@suse.com>	2015-09-05 11:08:32 +02:00
Linus Torvalds	1e1a4e8f43	Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper update from Mike Snitzer: - a couple small cleanups in dm-cache, dm-verity, persistent-data's dm-btree, and DM core. - a 4.1-stable fix for dm-cache that fixes the leaking of deferred bio prison cells - a 4.2-stable fix that adds feature reporting for the dm-stats features added in 4.2 - improve DM-snapshot to not invalidate the on-disk snapshot if snapshot device write overflow occurs; but a write overflow triggered through the origin device will still invalidate the snapshot. - optimize DM-thinp's async discard submission a bit now that late bio splitting has been included in block core. - switch DM-cache's SMQ policy lock from using a mutex to a spinlock; improves performance on very low latency devices (eg. NVMe SSD). - document DM RAID 4/5/6's discard support [ I did not pull the slab changes, which weren't appropriate for this tree, and weren't obviously the right thing to do anyway. At the very least they need some discussion and explanation before getting merged. Because not pulling the actual tagged commit but doing a partial pull instead, this merge commit thus also obviously is missing the git signature from the original tag ] * tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache: fix use after freeing migrations dm cache: small cleanups related to deferred prison cell cleanup dm cache: fix leaking of deferred bio prison cells dm raid: document RAID 4/5/6 discard support dm stats: report precise_timestamps and histogram in @stats_list output dm thin: optimize async discard submission dm snapshot: don't invalidate on-disk image on snapshot write overflow dm: remove unlikely() before IS_ERR() dm: do not override error code returned from dm_get_device() dm: test return value for DM_MAPIO_SUBMITTED dm verity: remove unused mempool dm cache: move wake_waker() from free_migrations() to where it is needed dm btree remove: remove unused function get_nr_entries() dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling() dm cache policy smq: change the mutex to a spinlock	2015-09-02 16:35:26 -07:00
Linus Torvalds	1081230b74	Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...	2015-09-02 13:10:25 -07:00
Linus Torvalds	089b669506	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "The usual stuff from trivial tree for 4.3 (kerneldoc updates, printk() fixes, Documentation and MAINTAINERS updates)" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits) MAINTAINERS: update my e-mail address mod_devicetable: add space before */ scsi: a100u2w: trivial typo in printk i2c: Fix typo in i2c-bfin-twi.c treewide: fix typos in comment blocks Doc: fix trivial typo in SubmittingPatches proportions: Spelling s/consitent/consistent/ dm: Spelling s/consitent/consistent/ aic7xxx: Fix typo in error message pcmcia: Fix typo in locking documentation scsi/arcmsr: Fix typos in error log drm/nouveau/gr: Fix typo in nv10.c [SCSI] Fix printk typos in drivers/scsi staging: comedi: Grammar s/Enable support a/Enable support for a/ Btrfs: Spelling s/consitent/consistent/ README: GTK+ is a acronym ASoC: omap: Fix typo in config option description mm: tlb.c: Fix error message ntfs: super.c: Fix error log fix typo in Documentation/SubmittingPatches ...	2015-09-01 18:46:42 -07:00
Joe Thornber	cc7da0ba9c	dm cache: fix use after freeing migrations Both free_io_migration() and issue_discard() dereference a migration that was just freed. Fix those by saving off the migrations's cache object before freeing the migration. Also cleanup needless mg->cache dereferences now that the cache object is available directly. Fixes: `e44b6a5a3c` ("dm cache: move wake_waker() from free_migrations() to where it is needed") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-09-01 08:56:14 -04:00
Mike Snitzer	dc9cee5db5	dm cache: small cleanups related to deferred prison cell cleanup Eliminate __cell_release() since it only had one caller that always released the cell holder. Switch cell_error_with_code() to using free_prison_cell() for the sake of consistency. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-31 15:50:28 -04:00
Joe Thornber	9153df7405	dm cache: fix leaking of deferred bio prison cells There were two cases where dm_cell_visit_release() was being called, which removes the cell from the prison's rbtree, but the callers didn't also return the cell to the mempool. Fix this by having them call free_prison_cell(). This leak manifested as the 'kmalloc-96' slab growing until OOM. Fixes: `651f5fa2a3` ("dm cache: defer whole cells") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+	2015-08-31 15:08:14 -04:00
NeilBrown	c3cce6cda1	md/raid5: ensure device failure recorded before write request returns. When a write to one of the devices of a RAID5/6 fails, the failure is recorded in the metadata of the other devices so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that completed when MD_CHANGE_PENDING is set to only be processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:59 +02:00
NeilBrown	34a6f80e16	md/raid5: use bio_list for the list of bios to return. This will make it easier to splice two lists together which will be needed in future patch. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:50 +02:00
NeilBrown	95af587e95	md/raid10: ensure device failure recorded before write request returns. When a write to one of the legs of a RAID10 fails, the failure is recorded in the metadata of the other legs so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:45 +02:00
NeilBrown	55ce74d4bf	md/raid1: ensure device failure recorded before write request returns. When a write to one of the legs of a RAID1 fails, the failure is recorded in the metadata of the other leg(s) so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:23 +02:00
NeilBrown	18b9f67962	md-cluster: remove inappropriate try_module_get from join() md_setup_cluster already calls try_module_get(), so this try_module_get isn't needed. Also, there is no matching module_put (except in error patch), so this leaves an unbalanced module count. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:17 +02:00
NeilBrown	6022e75bf0	md: extend spinlock protection in register_md_cluster_operations This code looks racy. The only possible race is if two modules try to register at the same time and that won't happen. But make the code look safe anyway. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:59 +02:00
Guoqing Jiang	abb9b22ac9	md-cluster: Read the disk bitmap sb and check if it needs recovery In gather_all_resync_info, we need to read the disk bitmap sb and check if it needs recovery. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:41 +02:00
Guoqing Jiang	eece075cda	md-cluster: only call complete(&cinfo->completion) when node join cluster Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure complete(&cinfo->completion) is only be invoked when node join cluster. Otherwise node failure could also call the complete, and it doesn't make sense to do it. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:31 +02:00
Guoqing Jiang	6e6d9f2cda	md-cluster: add missed lockres_free We also need to free the lock resource before goto out. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:23 +02:00
Guoqing Jiang	b2b9bfff0a	md-cluster: remove the unused sb_lock The sb_lock is not used anywhere, so let's remove it. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:14 +02:00
Guoqing Jiang	9e3072e373	md-cluster: init suspend_list and suspend_lock early in join If the node just join the cluster, and receive the msg from other nodes before init suspend_list, it will cause kernel crash due to NULL pointer dereference, so move the initializations early to fix the bug. md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3 BUG: unable to handle kernel NULL pointer dereference at (null) ... ... ... Call Trace: [<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster] [<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster] [<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod] [<ffffffff810768c4>] kthread+0xb4/0xc0 [<ffffffff8151927c>] ret_from_fork+0x7c/0xb0 ... ... ... RIP [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster] Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:42:05 +02:00
Guoqing Jiang	b5ef56789b	md-cluster: add the error check if failed to get dlm lock In complicated cluster environment, it is possible that the dlm lock couldn't be get/convert on purpose, the related err info is added for better debug potential issue. For lockres_free, if the lock is blocking by a lock request or conversion request, then dlm_unlock just put it back to grant queue, so need to ensure the lock is free finally. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:41:56 +02:00
Guoqing Jiang	b83d51c078	md-cluster: init completion within lockres_init We should init completion within lockres_init, otherwise completion could be initialized more than one time during it's life cycle. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:41:50 +02:00
Guoqing Jiang	66099bb0ee	md-cluster: fix deadlock issue on message lock There is problem with previous communication mechanism, and we got below deadlock scenario with cluster which has 3 nodes. Sender Receiver Receiver token(EX) message(EX) writes message downconverts message(CR) requests ack(EX) get message(CR) gets message(CR) reads message reads message requests EX on message requests EX on message To fix this problem, we do the following changes: 1. the sender downconverts MESSAGE to CW rather than CR. 2. and the receiver request PR lock not EX lock on message. And in case we failed to down-convert EX to CW on message, it is better to unlock message otherthan still hold the lock. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Lidong Zhong <ldzhong@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:41:41 +02:00
Guoqing Jiang	dc737d7c3d	md-cluster: transfer the resync ownership to another node When node A stops an array while the array is doing a resync, we need to let another node B take over the resync task. To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC message to the cluster. And the node B which received that message will invoke __recover_slot to do resync. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:41:12 +02:00
Guoqing Jiang	05cd0e5176	md-cluster: split recover_slot for future code reuse Make recover_slot as a wraper to __recover_slot, since the logic of __recover_slot can be reused for the condition when other nodes need to take over the resync job. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:40:41 +02:00
Guoqing Jiang	b89f704a8d	md-cluster: use %pU to print UUIDs Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:40:30 +02:00
Sasha Levin	25b2edfa3b	md: setup safemode_timer before it's being used We used to set up the safemode_timer timer in md_run. If md_run would fail before the timer was set up we'd end up trying to modify a timer that doesn't have a callback function when we access safe_delay_store, which would trigger a BUG. neilb: delete init_timer() call as setup_timer() does that. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:39:39 +02:00
NeilBrown	6cbd81487f	md/raid5: handle possible race as reshape completes. It is possible (though unlikely) for a reshape to be interrupted between the time that end_reshape is called and the time when raid5_finish_reshape is called. This can leave conf->reshape_progress set to MaxSector, but mddev->reshape_position not. This combination confused reshape_request() when ->reshape_backwards. As conf->reshape_progress is so high, it seems the reshape hasn't really begun. But assuming MaxSector is a valid address only leads to sorrow. So ensure reshape_position and reshape_progress both agree, and add an extra check in reshape_request() just in case they don't. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:38:59 +02:00
NeilBrown	5ed1df2eac	md: sync sync_completed has correct value as recovery finishes. There can be a small window between the moment that recovery actually writes the last block and the time when various sysfs and /proc/mdstat attributes report that it has finished. During this time, 'sync_completed' can have the wrong value. This can confuse monitoring software. So: - don't set curr_resync_completed beyond the end of the devices, - set it correctly when resync/recovery has completed. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:38:17 +02:00
NeilBrown	c5e19d906a	md: be careful when testing resync_max against curr_resync_completed. While it generally shouldn't happen, it is not impossible for curr_resync_completed to exceed resync_max. This can particularly happen when reshaping RAID5 - the current status isn't copied to curr_resync_completed promptly, so when it is, it can exceed resync_max. This happens when the reshape is 'frozen', resync_max is set low, and reshape is re-enabled. Taking a difference between two unsigned numbers is always dangerous anyway, so add a test to behave correctly if curr_resync_completed > resync_max Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:37:33 +02:00
NeilBrown	a4a3d26d87	md: set MD_RECOVERY_RECOVER when starting a degraded array. This ensures that 'sync_action' will show 'recover' immediately the array is started. If there is no spare the status will change to 'idle' once that is detected. Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change happens. This allows scripts which monitor status not to get confused - particularly my test scripts. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:37:03 +02:00
NeilBrown	c74c0d760e	md/raid5: remove incorrect "min_t()" when calculating writepos. This code is calculating: writepos, which is the furthest along address (device-space) that we will be writing to readpos, which is the earliest address that we could possible read from, and safepos, which is the earliest address in the 'old' section that we might read from after a crash when the reshape position is recovered from metadata. The first is a precise calculation, so clipping at zero doesn't make sense. As the reshape position is now guaranteed to always be a multiple of reshape_sectors and as we already BUG_ON when reshape_progress is zero, there is no point in this min_t() call. The readpos and safepos are worst case - actual value depends on precise geometry. That worst case could be negative, which is only a problem because we are storing the value in an unsigned. So leave the min_t() for those. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:36:06 +02:00
NeilBrown	05256d9884	md/raid5: strengthen check on reshape_position at run. When reshaping, we work in units of the largest chunk size. If changing from a larger to a smaller chunk size, that means we reshape more than one stripe at a time. So the required alignment of reshape_position needs to take into account both the old and new chunk size. This means that both 'here_new' and 'here_old' are calculated with respect to the same (maximum) chunk size, so testing if they are the same when delta_disks is zero becomes pointless. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:34:21 +02:00
NeilBrown	3cb5edf454	md/raid5: switch to use conf->chunk_sectors in place of mddev->chunk_sectors where possible The chunk_sectors and new_chunk_sectors fields of mddev can be changed any time (via sysfs) that the reconfig mutex can be taken. So raid5 keeps internal copies in 'conf' which are stable except for a short locked moment when reshape stops/starts. So any access that does not hold reconfig_mutex should use the 'conf' values, not the 'mddev' values. Several don't. This could result in corruption if new values were written at awkward times. Also use min() or max() rather than open-coding. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:32:48 +02:00
NeilBrown	5cac6bcb93	md/raid5: always set conf->prev_chunk_sectors and ->prev_algo These aren't really needed when no reshape is happening, but it is safer to have them always set to a meaningful value. The next patch will use ->prev_chunk_sectors without checking if a reshape is happening (because that makes the code simpler), and this patch makes that safe. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:32:25 +02:00
NeilBrown	02ec50265b	md/raid10: fix a few typos in comments Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:32:09 +02:00
NeilBrown	92140480ed	md/raid5: consider updating reshape_position at start of reshape. md/raid5 only updates ->reshape_position (which is stored in metadata and is authoritative) occasionally, but particularly when getting closed to ->resync_max as it must be correct when ->resync_max is reached. When mdadm tries to stop an array which is reshaping it will: - freeze the reshape, - set resync_max to where the reshape has reached. - unfreeze the reshape. When this happens, the reshape is aborted and then restarted. The restart doesn't check that resync_max is close, and so doesn't update ->reshape_position like it should. This results in the reshape stopping, but ->reshape_position being incorrect. So on that first call to reshape_request, make sure ->reshape_position is updated if needed. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:31:20 +02:00
NeilBrown	985ca973b6	md: close some races between setting and checking sync_action. When checking sync_action in a script, we want to be sure it is as accurate as possible. As resync/reshape etc doesn't always start immediately (a separate thread is scheduled to do it), it is best if 'action_show' checks if MD_RECOVER_NEEDED is set (which it does) and in that case reports what is likely to start soon (which it only sometimes does). So: - report 'reshape' if reshape_position suggests one might start. - set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very likely to happen next. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:30:40 +02:00
NeilBrown	f7851be736	md: Keep /proc/mdstat reporting recovery until fully DONE. Currently when a recovery completes, mdstat shows that it has finished before the new device is marked as a full member. Because of this it can appear to a script that the recovery finished but the array isn't in sync. So while MD_RECOVERY_DONE is still set, keep mdstat reporting "recovery". Once md_reap_sync_thread() completes, the spare will be active and then MD_RECOVERY_DONE will be cleared. To ensure this is race-free, set MD_RECOVERY_DONE before clearning curr_resync. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:29:09 +02:00
Linus Torvalds	1c00038c76	Char/Misc driver patches for 4.3-rc1 Here's the "big" char/misc driver update for 4.3-rc1. Not much really interesting here, just a number of little changes all over the place, and some nice consolidation of the nvmem drivers to a common framework. As usual, the mei drivers stand out as the largest "churn" to handle new devices and features in their hardware. All have been in linux-next for a while with no issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEABECAAYFAlXV844ACgkQMUfUDdst+ymYfQCgmDKjq3fsVHCxNZPxnukFYzvb xZkAnRb8fuub5gVQFP29A+rhyiuWD13v =Bq9K -----END PGP SIGNATURE----- Merge tag 'char-misc-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver patches from Greg KH: "Here's the "big" char/misc driver update for 4.3-rc1. Not much really interesting here, just a number of little changes all over the place, and some nice consolidation of the nvmem drivers to a common framework. As usual, the mei drivers stand out as the largest "churn" to handle new devices and features in their hardware. All have been in linux-next for a while with no issues" * tag 'char-misc-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits) auxdisplay: ks0108: initialize local parport variable extcon: palmas: Fix build break due to devm_gpiod_get_optional API change extcon: palmas: Support GPIO based USB ID detection extcon: Fix signedness bugs about break error handling extcon: Drop owner assignment from i2c_driver extcon: arizona: Simplify pdata symantics for micd_dbtime extcon: arizona: Declare 3-pole jack if we detect open circuit on mic extcon: Add exception handling to prevent the NULL pointer access extcon: arizona: Ensure variables are set for headphone detection extcon: arizona: Use gpiod inteface to handle micd_pol_gpio gpio extcon: arizona: Add basic microphone detection DT/ACPI bindings extcon: arizona: Update to use the new device properties API extcon: palmas: Remove the mutually_exclusive array extcon: Remove optional print_state() function pointer of struct extcon_dev extcon: Remove duplicate header file in extcon.h extcon: max77843: Clear IRQ bits state before request IRQ toshiba laptop: replace ioremap_cache with ioremap misc: eeprom: max6875: clean up max6875_read() misc: eeprom: clean up eeprom_read() misc: eeprom: 93xx46: clean up eeprom_93xx46_bin_read/write ...	2015-08-31 08:34:13 -07:00
Christoph Hellwig	566079c849	dm-mpath, scsi_dh: request scsi_dh modules in scsi_dh, not dm-mpath This way we can reused the same code any attachment method, not just those requested from dm-mpath. [jejb: fixup checkpatch error] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: James Bottomley <JBottomley@Odin.com>	2015-08-28 13:14:55 -07:00
Christoph Hellwig	1bab0de027	dm-mpath, scsi_dh: don't let dm detach device handlers While allowing dm-mpath to attach device handlers is a functionality we need for backwards compatibility reason there is no reason to reference count them and detach them if dm-mpath stops using the device for some reason. If the device handler works for the given device it can just stay attached, and we can take the retain_hw_handler codepath. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Hannes Reinecke <hare@Suse.de> Signed-off-by: James Bottomley <JBottomley@Odin.com>	2015-08-28 13:14:54 -07:00
Keith Busch	03100aada9	block: Replace SG_GAPS with new queue limits mask The SG_GAPS queue flag caused checks for bio vector alignment against PAGE_SIZE, but the device may have different constraints. This patch adds a queue limits so a driver with such constraints can set to allow requests that would have been unnecessarily split. The new gaps check takes the request_queue as a parameter to simplify the logic around invoking this function. This new limit makes the queue flag redundant, so removing it and all usage. Device-mappers will inherit the correct settings through blk_stack_limits(). Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-19 14:26:02 -07:00
Mikulas Patocka	bd49784fd1	dm stats: report precise_timestamps and histogram in @stats_list output If the user selected the precise_timestamps or histogram options, report it in the @stats_list message output. If the user didn't select these options, no extra tokens are reported, thus it is backward compatible with old software that doesn't know about precise timestamps and histogram. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.2	2015-08-18 17:20:03 -04:00
Mike Snitzer	84f8bd86cc	dm thin: optimize async discard submission __blkdev_issue_discard_async() doesn't need to worry about further splitting because the upper layer blkdev_issue_discard() will have already handled splitting bios such that the bi_size isn't overflowed. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2015-08-18 11:36:11 -04:00
Kent Overstreet	b54ffb73ca	block: remove bio_get_nr_vecs() We can always fill up the bio now, no need to estimate the possible size based on queue parameters. Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [hch: rebased and wrote a changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:32:04 -06:00
Kent Overstreet	8ae126660f	block: kill merge_bvec_fn() completely As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:57 -06:00
Kent Overstreet	7140aafce2	md/raid5: get rid of bio_fits_rdev() Remove bio_fits_rdev() as sufficient merge_bvec_fn() handling is now performed by blk_queue_split() in md_make_request(). Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:54 -06:00
Ming Lin	7ef6b12a19	md/raid5: split bio for chunk_aligned_read If a read request fits entirely in a chunk, it will be passed directly to the underlying device (providing it hasn't failed of course). If it doesn't fit, the slightly less efficient path that uses the stripe_cache is used. Requests that get to the stripe cache are always completely split up as necessary. So with RAID5, ripping out the merge_bvec_fn doesn't cause it to stop work, but could cause it to take the less efficient path more often. All that is needed to manage this is for 'chunk_aligned_read' do some bio splitting, much like the RAID0 code does. Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:51 -06:00
Kent Overstreet	749b61dab3	bcache: remove driver private bio splitting code The bcache driver has always accepted arbitrarily large bios and split them internally. Now that every driver must accept arbitrarily large bios this code isn't nessecary anymore. Cc: linux-bcache@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:40 -06:00
Kent Overstreet	54efd50bfd	block: make generic_make_request handle arbitrarily sized bios The way the block layer is currently written, it goes to great lengths to avoid having to split bios; upper layer code (such as bio_add_page()) checks what the underlying device can handle and tries to always create bios that don't need to be split. But this approach becomes unwieldy and eventually breaks down with stacked devices and devices with dynamic limits, and it adds a lot of complexity. If the block layer could split bios as needed, we could eliminate a lot of complexity elsewhere - particularly in stacked drivers. Code that creates bios can then create whatever size bios are convenient, and more importantly stacked drivers don't have to deal with both their own bio size limitations and the limitations of the (potentially multiple) devices underneath them. In the future this will let us delete merge_bvec_fn and a bunch of other code. We do this by adding calls to blk_queue_split() to the various make_request functions that need it - a few can already handle arbitrary size bios. Note that we add the call _after_ any call to blk_queue_bounce(); this means that blk_queue_split() and blk_recalc_rq_segments() don't need to be concerned with bouncing affecting segment merging. Some make_request_fn() callbacks were simple enough to audit and verify they don't need blk_queue_split() calls. The skipped ones are: * nfhd_make_request (arch/m68k/emu/nfblock.c) * axon_ram_make_request (arch/powerpc/sysdev/axonram.c) * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c) * brd_make_request (ramdisk - drivers/block/brd.c) * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c) * loop_make_request * null_queue_bio * bcache's make_request fns Some others are almost certainly safe to remove now, but will be left for future patches. Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Geoff Levand <geoff@infradead.org> Cc: Jim Paris <jim@jtan.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: skip more mq-based drivers, resolve merge conflicts, etc.] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:33 -06:00
Mikulas Patocka	76c44f6d80	dm snapshot: don't invalidate on-disk image on snapshot write overflow When the snapshot overflows because of a write to the origin, the on-disk image has to be invalidated. However, when the snapshot overflows because of a write to the snapshot, the on-disk image doesn't have to be invalidated. Change the behavior so that the on-disk image is not invalidated in this case. When the snapshot overflows, the variable snapshot_overflowed is set. All writes to the snapshot are disallowed to minimize filesystem corruption - this condition is cleared when the snapshot is deactivated and activated. The user can extend the overflowed snapshot, deactivate and activate it again, run fsck (if journaling filesystem is not used) mount it and recover the data. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-12 16:22:55 -04:00
viresh kumar	fc0a446152	dm: remove unlikely() before IS_ERR() IS_ERR() already contains an 'unlikely' compiler flag so there is no need to do that again from IS_ERR() callers. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-12 11:32:21 -04:00
Vivek Goyal	e80d1c805a	dm: do not override error code returned from dm_get_device() Some of the device mapper targets override the error code returned by dm_get_device() and return either -EINVAL or -ENXIO. There is nothing gained by this override. It is better to propagate the returned error code unchanged to caller. This work was motivated by hitting an issue where the underlying device was busy but -EINVAL was being returned. After this change we get -EBUSY instead and it is easier to figure out the problem. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-12 11:32:21 -04:00
Mikulas Patocka	ab37844d61	dm: test return value for DM_MAPIO_SUBMITTED In properly written code we should not assume that DM_MAPIO_SUBMITTED is zero. We should test the return value for DM_MAPIO_SUBMITTED rather than testing it for zero. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-12 11:32:20 -04:00
Sami Tolvanen	4fb9aa5abd	dm verity: remove unused mempool Since commit `003b5c571` ("block: Convert drivers to immutable biovecs"), vec_mempool in struct dm_verity is no longer used. Remove it and related definitions. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-12 11:32:20 -04:00
Joe Thornber	e44b6a5a3c	dm cache: move wake_waker() from free_migrations() to where it is needed This stops spurious wake ups from calls to prealloc_free_structs(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-12 11:32:20 -04:00
Vivek Goyal	8c747fd0c3	dm btree remove: remove unused function get_nr_entries() rebalance_children() calls get_nr_entries() and assigns the result to an unused local 'child_entries' variable. Remove get_nr_entries() and cleanup rebalance_children() accordingly. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-12 11:32:19 -04:00
Vivek Goyal	0a8d4c3ef8	dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling() Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-12 11:32:19 -04:00
Joe Thornber	4051aab762	dm cache policy smq: change the mutex to a spinlock We no longer sleep in any of the smq functions, so this can become a spinlock. Switching from mutex to spinlock improves performance when the fast cache device is a very low latency device (e.g. NVMe SSD). The switch to spinlock also allows for removal of the extra tick_lock; which is no longer needed since the main lock being a spinlock now fulfills the locking requirements needed by interrupt context. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-12 11:32:19 -04:00
Yi Zhang	34dd051741	dm cache policy smq: move 'dm-cache-default' module alias to SMQ When creating dm-cache with the default policy, it will call request_module("dm-cache-default") to register the default policy. But the "dm-cache-default" alias was left referring to the MQ policy. Fix this by moving the module alias to SMQ. Fixes: `bccab6a0` (dm cache: switch the "default" cache replacement policy from mq to smq) Signed-off-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-12 11:27:29 -04:00
Joe Thornber	b0dc3c8bc1	dm btree: add ref counting ops for the leaves of top level btrees When using nested btrees, the top leaves of the top levels contain block addresses for the root of the next tree down. If we shadow a shared leaf node the leaf values (sub tree roots) should be incremented accordingly. This is only an issue if there is metadata sharing in the top levels. Which only occurs if metadata snapshots are being used (as is possible with dm-thinp). And could result in a block from the thinp metadata snap being reused early, thus corrupting the thinp metadata snap. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-08-12 10:50:37 -04:00
Joe Thornber	7f518ad0a2	dm thin metadata: delete btrees when releasing metadata snapshot The device details and mapping trees were just being decremented before. Now btree_del() is called to do a deep delete. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-08-12 10:42:51 -04:00
Sasha Levin	9b81c84235	block: don't access bio->bi_error after bio_put() Commit `4246a0b6` ("block: add a bi_error field to struct bio") has added a few dereferences of 'bio' after a call to bio_put(). This causes use-after-frees such as: [521120.719695] BUG: KASan: use after free in dio_bio_complete+0x2b3/0x320 at addr ffff880f36b38714 [521120.720638] Read of size 4 by task mount.ocfs2/9644 [521120.721212] ============================================================================= [521120.722056] BUG kmalloc-256 (Not tainted): kasan: bad access detected [521120.722968] ----------------------------------------------------------------------------- [521120.722968] [521120.723915] Disabling lock debugging due to kernel taint [521120.724539] INFO: Slab 0xffffea003cdace00 objects=32 used=25 fp=0xffff880f36b38600 flags=0x46fffff80004080 [521120.726037] INFO: Object 0xffff880f36b38700 @offset=1792 fp=0xffff880f36b38800 [521120.726037] [521120.726974] Bytes b4 ffff880f36b386f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.727898] Object ffff880f36b38700: 00 88 b3 36 0f 88 ff ff 00 00 d8 de 0b 88 ff ff ...6............ [521120.728822] Object ffff880f36b38710: 02 00 00 f0 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.729705] Object ffff880f36b38720: 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ................ [521120.730623] Object ffff880f36b38730: 00 00 00 00 00 00 00 00 01 00 00 00 00 02 00 00 ................ [521120.731621] Object ffff880f36b38740: 00 02 00 00 01 00 00 00 d0 f7 87 ad ff ff ff ff ................ [521120.732776] Object ffff880f36b38750: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.733640] Object ffff880f36b38760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.734508] Object ffff880f36b38770: 01 00 03 00 01 00 00 00 88 87 b3 36 0f 88 ff ff ...........6.... [521120.735385] Object ffff880f36b38780: 00 73 22 ad 02 88 ff ff 40 13 e0 3c 00 ea ff ff .s".....@..<.... [521120.736667] Object ffff880f36b38790: 00 02 00 00 00 04 00 00 00 00 00 00 00 00 00 00 ................ [521120.737596] Object ffff880f36b387a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.738524] Object ffff880f36b387b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.739388] Object ffff880f36b387c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.740277] Object ffff880f36b387d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.741187] Object ffff880f36b387e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.742233] Object ffff880f36b387f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [521120.743229] CPU: 41 PID: 9644 Comm: mount.ocfs2 Tainted: G B 4.2.0-rc6-next-20150810-sasha-00039-gf909086 #2420 [521120.744274] ffff880f36b38000 ffff880d89c8f638 ffffffffb6e9ba8a ffff880101c0e5c0 [521120.745025] ffff880d89c8f668 ffffffffad76a313 ffff880101c0e5c0 ffffea003cdace00 [521120.745908] ffff880f36b38700 ffff880f36b38798 ffff880d89c8f690 ffffffffad772854 [521120.747063] Call Trace: [521120.747520] dump_stack (lib/dump_stack.c:52) [521120.748053] print_trailer (mm/slub.c:653) [521120.748582] object_err (mm/slub.c:660) [521120.749079] kasan_report_error (include/linux/kasan.h:20 mm/kasan/report.c:152 mm/kasan/report.c:194) [521120.750834] __asan_report_load4_noabort (mm/kasan/report.c:250) [521120.753580] dio_bio_complete (fs/direct-io.c:478) [521120.755752] do_blockdev_direct_IO (fs/direct-io.c:494 fs/direct-io.c:1291) [521120.759765] __blockdev_direct_IO (fs/direct-io.c:1322) [521120.761658] blkdev_direct_IO (fs/block_dev.c:162) [521120.762993] generic_file_read_iter (mm/filemap.c:1738) [521120.767405] blkdev_read_iter (fs/block_dev.c:1649) [521120.768556] __vfs_read (fs/read_write.c:423 fs/read_write.c:434) [521120.772126] vfs_read (fs/read_write.c:454) [521120.773118] SyS_pread64 (fs/read_write.c:607 fs/read_write.c:594) [521120.776062] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186) [521120.777375] Memory state around the buggy address: [521120.778118] ffff880f36b38600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [521120.779211] ffff880f36b38680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [521120.780315] >ffff880f36b38700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [521120.781465] ^ [521120.782083] ffff880f36b38780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [521120.783717] ffff880f36b38800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [521120.784818] ================================================================== This patch fixes a few of those places that I caught while auditing the patch, but the original patch should be audited further for more occurences of this issue since I'm not too familiar with the code. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-11 11:34:32 -06:00
Greg Kroah-Hartman	5d44f4b348	Merge 4.2-rc6 into char-misc-next We want the fixes in Linus's tree in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-08-09 16:28:09 -07:00
Linus Torvalds	39171c86f1	- Stable fix for a dm_merge_bvec() regression on 32 bit Fedora systems. - Fix for a 4.2 DM thinp discard regression due to inability to properly delete a range of blocks in a data mapping btree. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJVxNokAAoJEMUj8QotnQNaclUIALSoAHN1B47juyNgEjkDQI8x LodGbto3gLIigsYTrVMH13ohwVaekErsYbe65L8wQT5vyNcXbHTs6YRyCfVAdEhb 4/hmGkVaRtitnHwxrzUv+ycCb1Gj/rtSGdt264AiSMAL2cD/ucL1aLYhkiEdGp9k hv5qKxjFyzSK0NVAuKn5B8tNtNGVMQv4faTiWKs5JVA4a8RERD2B4+nDROJY3SwM zgHUPMsPAUb4uphwa0q+qhhrGZWAspS43fCjVBbjXi9lSeUV0TpeYoG4tV7A8T+o jP9xq4kCLbZ7PzYjMWp6s30+H741Wi1AkBx0NkS7Aoz9kt+SBMlTMFXojPP5Pks= =AkR0 -----END PGP SIGNATURE----- Merge tag 'dm-4.2-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - stable fix for a dm_merge_bvec() regression on 32 bit Fedora systems. - fix for a 4.2 DM thinp discard regression due to inability to properly delete a range of blocks in a data mapping btree. * tag 'dm-4.2-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm btree remove: fix bug in remove_one() dm: fix dm_merge_bvec regression on 32 bit systems	2015-08-08 04:35:14 +03:00
Joe Thornber	aa0cd28d05	dm btree remove: fix bug in remove_one() remove_one() was not incrementing the key for the beginning of the range, so not all entries were being removed. This resulted in discards that were not unmapping all blocks. Fixes: `4ec331c3ea` ("dm btree: add dm_btree_remove_leaves()") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-07 11:56:43 -04:00
Geert Uytterhoeven	57d4248731	dm: Spelling s/consitent/consistent/ Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jiri Kosina <jkosina@suse.com>	2015-08-07 14:36:43 +02:00
Greg Kroah-Hartman	f368ed6088	char: make misc_deregister a void function With well over 200+ users of this api, there are a mere 12 users that actually checked the return value of this function. And all of them really didn't do anything with that information as the system or module was shutting down no matter what. So stop pretending like it matters, and just return void from misc_deregister(). If something goes wrong in the call, you will get a WARNING splat in the syslog so you know how to fix up your driver. Other than that, there's nothing that can go wrong. Cc: Alasdair Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Wim Van Sebroeck <wim@iguana.be> Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: David Teigland <teigland@redhat.com> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <jlbec@evilplan.org> Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Acked-by: Alessandro Zummo <a.zummo@towertech.it> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-08-05 10:35:49 -07:00
Mike Snitzer	bd4aaf8f9b	dm: fix dm_merge_bvec regression on 32 bit systems A DM regression on 32 bit systems was reported against v4.2-rc3 here: https://lkml.org/lkml/2015/7/29/401 Fix this by reverting both commit `1c220c69` ("dm: fix casting bug in dm_merge_bvec()") and `148e51ba` ("dm: improve documentation and code clarity in dm_merge_bvec"). This combined revert is done to eliminate the possibility of a partial revert in stable@ kernels. In hindsight the correct fix, at the time `1c220c69` was applied to fix the regression that `148e51ba` introduced, should've been to simply revert `148e51ba`. Reported-by: Josh Boyer <jwboyer@fedoraproject.org> Tested-by: Adam Williamson <awilliam@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.19+	2015-08-03 22:49:59 -04:00
NeilBrown	199dc6ed51	md/raid0: update queue parameter in a safer location. When a (e.g.) RAID5 array is reshaped to RAID0, the updating of queue parameters (e.g. max number of sectors per bio) is done in the wrong place. It should be part of ->run, but it is actually part of ->takeover. This means it happens before level_store() calls: blk_set_stacking_limits(&mddev->queue->limits); and so it ineffective. This can lead to errors from underlying devices. So move all the relevant settings out of create_stripe_zones() and into raid0_run(). As this can lead to a bug-on it is suitable for any -stable kernel which supports reshape to RAID0. So 2.6.35 or later. As the bug has been present for five years there is no urgency, so no need to rush into -stable. Fixes: `9af204cf72` ("md: Add support for Raid5->Raid0 and Raid10->Raid0 takeover") Cc: stable@vger.kernel.org (v2.6.35+ - please delay until after -final release). Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-03 17:12:44 +10:00
Benjamin Randazzo	25eafe1a81	md: simplify get_bitmap_file now that "file" is zeroed. There is no point assigning '\0' to file->pathname[0] as file is now zeroed out, so remove that branch and simplify the code. [Original patch combined this with the change to use kzalloc. I split the two so that the change to kzalloc is easier to backport. - neilb] Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-03 17:12:44 +10:00
NeilBrown	49895bcc7e	md/raid5: don't let shrink_slab shrink too far. I have a report of drop_one_stripe() called from raid5_cache_scan() apparently finding ->max_nr_stripes == 0. This should not be allowed. So add a test to keep max_nr_stripes above min_nr_stripes. Also use a 'mask' rather than a 'mod' in drop_one_stripe to ensure 'hash' is valid even if max_nr_stripes does reach zero. Fixes: `edbe83ab4c` ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: stable@vger.kernel.org (4.1 - please release with `2d5b569b66`) Reported-by: Tomas Papan <tomas.papan@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-03 17:10:56 +10:00
Benjamin Randazzo	b6878d9e03	md: use kzalloc() when bitmap is disabled In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a mdu_bitmap_file_t called "file". 5769 file = kmalloc(sizeof(file), GFP_NOIO); 5770 if (!file) 5771 return -ENOMEM; This structure is copied to user space at the end of the function. 5786 if (err == 0 && 5787 copy_to_user(arg, file, sizeof(file))) 5788 err = -EFAULT But if bitmap is disabled only the first byte of "file" is initialized with zero, so it's possible to read some bytes (up to 4095) of kernel space memory from user space. This is an information leak. 5775 /* bitmap disabled, zero the first byte and copy out */ 5776 if (!mddev->bitmap_info.file) 5777 file->pathname[0] = '\0'; Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr> Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-03 14:56:02 +10:00
NeilBrown	423f04d63c	md/raid1: extend spinlock to protect raid1_end_read_request against inconsistencies raid1_end_read_request() assumes that the In_sync bits are consistent with the ->degaded count. raid1_spare_active updates the In_sync bit before the ->degraded count and so exposes an inconsistency, as does error() So extend the spinlock in raid1_spare_active() and error() to hide those inconsistencies. This should probably be part of Commit: `34cab6f420` ("md/raid1: fix test for 'was read error from last working device'.") as it addresses the same issue. It fixes the same bug and should go to -stable for same reasons. Fixes: `76073054c9` ("md/raid1: clean up read_balance.") Cc: stable@vger.kernel.org (v3.0+) Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-03 12:29:42 +10:00
Mike Snitzer	795e633a2d	dm cache: fix device destroy hang due to improper prealloc_used accounting Commit `665022d72f` ("dm cache: avoid calls to prealloc_free_structs() if possible") introduced a regression that caused the removal of a DM cache device to hang in cache_postsuspend()'s call to wait_for_migrations() with the following stack trace: [<ffffffff81651457>] schedule+0x37/0x80 [<ffffffffa041e21b>] cache_postsuspend+0xbb/0x470 [dm_cache] [<ffffffff810ba970>] ? prepare_to_wait_event+0xf0/0xf0 [<ffffffffa0006f77>] dm_table_postsuspend_targets+0x47/0x60 [dm_mod] [<ffffffffa0001eb5>] __dm_destroy+0x215/0x250 [dm_mod] [<ffffffffa0004113>] dm_destroy+0x13/0x20 [dm_mod] [<ffffffffa00098cd>] dev_remove+0x10d/0x170 [dm_mod] [<ffffffffa00097c0>] ? dev_suspend+0x240/0x240 [dm_mod] [<ffffffffa0009f85>] ctl_ioctl+0x255/0x4d0 [dm_mod] [<ffffffff8127ac00>] ? SYSC_semtimedop+0x280/0xe10 [<ffffffffa000a213>] dm_ctl_ioctl+0x13/0x20 [dm_mod] [<ffffffff811fd432>] do_vfs_ioctl+0x2d2/0x4b0 [<ffffffff81117d5f>] ? __audit_syscall_entry+0xaf/0x100 [<ffffffff81022636>] ? do_audit_syscall_entry+0x66/0x70 [<ffffffff811fd689>] SyS_ioctl+0x79/0x90 [<ffffffff81023e58>] ? syscall_trace_leave+0xb8/0x110 [<ffffffff81654f6e>] entry_SYSCALL_64_fastpath+0x12/0x71 Fix this by accounting for the call to prealloc_data_structs() immediately _before_ the call as opposed to after. This is needed because it is possible to break out of the control loop after the call to prealloc_data_structs() but before prealloc_used was set to true. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-29 14:32:09 -04:00
Mike Snitzer	3508e6590d	Revert "dm cache: do not wake_worker() in free_migration()" This reverts commit `386cb7cdee`. Taking the wake_worker() out of free_migration() will slow writeback dramatically, and hence adaptability. Say we have 10k blocks that need writing back, but are only able to issue 5 concurrently due to the migration bandwidth: it's imperative that we wake_worker() immediately after migration completion; waiting for the next 1 second wake up (via do_waker) means it'll take a long time to write that all back. Reported-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-29 14:32:08 -04:00
Jens Axboe	b7c44ed9d2	block: manipulate bio->bi_flags through helpers Some places use helpers now, others don't. We only have the 'is set' helper, add helpers for setting and clearing flags too. It was a bit of a mess of atomic vs non-atomic access. With BIO_UPTODATE gone, we don't have any risk of concurrent access to the flags. So relax the restriction and don't make any of them atomic. The flags that do have serialization issues (reffed and chained), we already handle those separately. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-07-29 08:55:20 -06:00
Christoph Hellwig	4246a0b63b	block: add a bi_error field to struct bio Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-07-29 08:55:15 -06:00
Baruch Siach	6ed443c07f	dm crypt: update wiki page URL Cryptsetup moved to gitlab. This is a leftover from commit `e44f23b32d` (dm crypt: update URLs to new cryptsetup project page, 2015-04-05). Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-27 07:58:16 -04:00
Colin Ian King	134bf30c06	dm cache policy smq: fix alloc_bitset check that always evaluates as false static analysis by cppcheck has found a check on alloc_bitset that always evaluates as false and hence never finds an allocation failure: [drivers/md/dm-cache-policy-smq.c:1689]: (warning) Logical conjunction always evaluates to false: !EXPR && EXPR. Fix this by removing the incorrect mq->cache_hit_bits check Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-27 07:58:15 -04:00
Mike Snitzer	0a927c2f02	dm thin: return -ENOSPC when erroring retry list due to out of data space Otherwise -EIO would be returned when -ENOSPC should be used consistently. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-26 17:39:19 -04:00
Linus Torvalds	aca105a697	Some md fixes for 4.2 Several are tagged for -stable. A few aren't because they are not very, serious or because they are in the 'experimental' cluster code. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJVsbRgAAoJEDnsnt1WYoG54BcQAJlVxcGdMXNAbFAfhToH+cwY DcCKmnXiu3TCcK/gtr0SlUKIhv7kPE2HaGbTko3g5/uTk7SCuMSWRbjpQJr1u98U 9VUPZV0RGLEcQXgsjG3sobEtdSYMSn1/BpdJpmsn/Q6cyTheiEJA9fEghn9F0Iw9 3Ctc56aL0nKsnpeRTvWmKR3F995L2Ene+pIHPbSqTlQcGU+DdxQL/iY9YspMzeih 6SWQICo59+pGSRQdKaU968nZ1a8hJCvhV2NbW0JsJMo9iGo5htCJLdxZatscZV6E xAppCRymW/Nl0n59wCOBhUp+0skVhtYZ1UmNdx/vUA7vcZJX85vIG7WGaaSQAmlF Y4nNMSaX6SWbt/oVgq+JiD39T1oFZN5N5Fh9juqcp3fshiWLO74cnTwea1LtkQZX LfW2HhajDUgoJqoXL6ppKDK8l7alQdcaoYn0C6SfqRZ+PLB5ERrSBHNZpKDbI9aw CFW93lHTtlwtwZn0S7k06mCG8ilO4iPthUVaJEeI1kUWVKY0Ju4xnVTt/7jCroUa Km14KdWeZsS8j9/xr6FYBjarjuh7M9SoflgUK84txE+PIZsSczyrErpBkMf2YBWQ 0KduqQabXa1XYWGtZ7Uhj6iOOGNBcTcZsww8cBEDOJEDyCtSs5w/iPmsFCEGUuo2 YTz6qKpYPgB1VzK2PoBG =6ZCK -----END PGP SIGNATURE----- Merge tag 'md/4.2-fixes' of git://neil.brown.name/md Pull md fixes from Neil Brown: "Some md fixes for 4.2 Several are tagged for -stable. A few aren't because they are not very, serious or because they are in the 'experimental' cluster code" * tag 'md/4.2-fixes' of git://neil.brown.name/md: md/raid5: clear R5_NeedReplace when no longer needed. Fix read-balancing during node failure md-cluster: fix bitmap sub-offset in bitmap_read_sb md: Return error if request_module fails and returns positive value md: Skip cluster setup in case of error while reading bitmap md/raid1: fix test for 'was read error from last working device'. md: Skip cluster setup for dm-raid md: flush ->event_work before stopping array. md/raid10: always set reshape_safe when initializing reshape_position. md/raid5: avoid races when changing cache size.	2015-07-25 11:24:58 -07:00
NeilBrown	e6030cb06c	md/raid5: clear R5_NeedReplace when no longer needed. This flag is currently never cleared, which can in rare cases trigger a warn-on if it is still set but the block isn't InSync. So clear it when it isn't need, which includes if the replacement device has failed. Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-24 13:38:04 +10:00
Goldwyn Rodrigues	90382ed9af	Fix read-balancing during node failure During a node failure, We need to suspend read balancing so that the reads are directed to the first device and stale data is not read. Suspending writes is not required because these would be recorded and synced eventually. A new flag MD_CLUSTER_SUSPEND_READ_BALANCING is set in recover_prep(). area_resyncing() will respond true for the entire devices if this flag is set and the request type is READ. The flag is cleared in recover_done(). Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reported-By: David Teigland <teigland@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-24 13:37:59 +10:00
Goldwyn Rodrigues	33e38ac688	md-cluster: fix bitmap sub-offset in bitmap_read_sb bitmap_read_sb is modifying mddev->bitmap_info.offset. This works for the first bitmap read. However, when multiple bitmaps need to be opened by the same node, it ends up corrupting the offset. Fix it by using a local variable. Also, bitmap_read_sb is not required in bitmap_copy_from_slot since it is called in bitmap_create. Remove bitmap_read_sb(). Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-24 13:37:55 +10:00
Goldwyn Rodrigues	b0c26a79d6	md: Return error if request_module fails and returns positive value request_module() can return 256 (process exited) in some cases, which is not as specified in the documentation before the request_module() definition. Convert the error to -ENOENT. The positive error number results in bitmap_create() returning a value that is meant to be an error but doesn't look like one, so it is dereferenced as a point and causes a crash. (not needed for stable as this is "experimental" code) Fixes: `edb39c9ded` ("Introduce md_cluster_operations to handle cluster functions") Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-24 13:37:51 +10:00
Goldwyn Rodrigues	f735727319	md: Skip cluster setup in case of error while reading bitmap If the bitmap read fails, the error code set is -EINVAL. However, we don't check for errors and go ahead with cluster_setup. Skip the cluster setup in case of error. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-24 13:37:48 +10:00
NeilBrown	34cab6f420	md/raid1: fix test for 'was read error from last working device'. When we get a read error from the last working device, we don't try to repair it, and don't fail the device. We simple report a read error to the caller. However the current test for 'is this the last working device' is wrong. When there is only one fully working device, it assumes that a non-faulty device is that device. However a spare which is rebuilding would be non-faulty but so not the only working device. So change the test from "!Faulty" to "In_sync". If ->degraded says there is only one fully working device and this device is in_sync, this must be the one. This bug has existed since we allowed read_balance to read from a recovering spare in v3.0 Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com> Fixes: `76073054c9` ("md/raid1: clean up read_balance.") Cc: stable@vger.kernel.org (v3.0+) Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-24 13:37:21 +10:00
Goldwyn Rodrigues	d3b178adb3	md: Skip cluster setup for dm-raid There is a bug that the bitmap superblock isn't initialised properly for dm-raid, so a new field can have garbage in new fields. (dm-raid does initialisation in the kernel - md initialised the superblock in mdadm). This means that for dm-raid we cannot currently trust the new ->nodes field. So: - use __GFP_ZERO to initialise the superblock properly for all new arrays - initialise all fields in bitmap_info in bitmap_new_disk_sb - ignore ->nodes for dm arrays (yes, this is a hack) This bug exposes dm-raid to bug in the (still experimental) md-cluster code, so it is suitable for -stable. It does cause crashes. References: https://bugzilla.kernel.org/show_bug.cgi?id=100491 Cc: stable@vger.kernel.org (v4.1) Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-23 09:22:00 +10:00
NeilBrown	ee5d004fd0	md: flush ->event_work before stopping array. The 'event_work' worker used by dm-raid may still be running when the array is stopped. This can result in an oops. So flush the workqueue on which it is run after detaching and before destroying the device. Reported-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Cc: stable@vger.kernel.org (2.6.38+ please delay 2 weeks after -final release) Fixes: `9d09e663d5` ("dm: raid456 basic support")	2015-07-22 14:09:29 +10:00
NeilBrown	299b0685e3	md/raid10: always set reshape_safe when initializing reshape_position. 'reshape_position' tracks where in the reshape we have reached. 'reshape_safe' tracks where in the reshape we have safely recorded in the metadata. These are compared to determine when to update the metadata. So it is important that reshape_safe is initialised properly. Currently it isn't. When starting a reshape from the beginning it usually has the correct value by luck. But when reducing the number of devices in a RAID10, it has the wrong value and this leads to the metadata not being updated correctly. This can lead to corruption if the reshape is not allowed to complete. This patch is suitable for any -stable kernel which supports RAID10 reshape, which is 3.5 and later. Fixes: `3ea7daa5d7` ("md/raid10: add reshape support") Cc: stable@vger.kernel.org (v3.5+ please wait for -final to be out for 2 weeks) Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-22 14:08:24 +10:00
NeilBrown	2d5b569b66	md/raid5: avoid races when changing cache size. Cache size can grow or shrink due to various pressures at any time. So when we resize the cache as part of a 'grow' operation (i.e. change the size to allow more devices) we need to blocks that automatic growing/shrinking. So introduce a mutex. auto grow/shrink uses mutex_trylock() and just doesn't bother if there is a blockage. Resizing the whole cache holds the mutex to ensure that the correct number of new stripes is allocated. This bug can result in some stripes not being freed when an array is stopped. This leads to the kmem_cache not being freed and a subsequent array can try to use the same kmem_cache and get confused. Fixes: `edbe83ab4c` ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2) Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-22 14:04:15 +10:00
Linus Torvalds	3f8476fe89	- Revert a request-based DM core change that caused IO latency to increase and adversely impact both throughput and system load - Fix for a use after free bug in DM core's device cleanup - A couple DM btree removal fixes (used by dm-thinp) - A DM thinp fix for order-5 allocation failure - A DM thinp fix to not degrade to read-only metadata mode when in out-of-data-space mode for longer than the 'no_space_timeout' - Fix a long-standing oversight in both dm-thinp and dm-cache by now exporting 'needs_check' in status if it was set in metadata - Fix an embarrassing dm-cache busy-loop that caused worker threads to eat cpu even if no IO was actively being issued to the cache device -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJVqUfSAAoJEMUj8QotnQNa+RUH+wXrHCGI6J7RHIXVd5igP9K0 yFZGEnLZe6Ebt5CACLcKn/qN0g97iwCrlcxFt+1Gj/GbW1GIQzs7vg38La3PZxWZ jAkI3JMY816bP1x3VK1HtMsk2gRaE/hh0gxK5pPLB9a+ZdEsz9UML0rs+JseOdn3 n+454dhwOyChwz7zFEbpn+mfjoruFScGX0Y2qaSHBV/xMhmExpthw9V1yFC2v2tW 8cAHOMDLNLHhR5adF9YxjZH8wILbyYK9oPy3iGhj/TF/Dx7saWYG4UlnL5xIOLsB 5WK9gRrJJ/Wf0FsDdN88AaY4Bdpj4esS2JeTZpvujxeBb7ZNeJoCUqyzggURv/c= =hCjo -----END PGP SIGNATURE----- Merge tag 'dm-4.2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - revert a request-based DM core change that caused IO latency to increase and adversely impact both throughput and system load - fix for a use after free bug in DM core's device cleanup - a couple DM btree removal fixes (used by dm-thinp) - a DM thinp fix for order-5 allocation failure - a DM thinp fix to not degrade to read-only metadata mode when in out-of-data-space mode for longer than the 'no_space_timeout' - fix a long-standing oversight in both dm-thinp and dm-cache by now exporting 'needs_check' in status if it was set in metadata - fix an embarrassing dm-cache busy-loop that caused worker threads to eat cpu even if no IO was actively being issued to the cache device * tag 'dm-4.2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache: avoid calls to prealloc_free_structs() if possible dm cache: avoid preallocation if no work in writeback_some_dirty_blocks() dm cache: do not wake_worker() in free_migration() dm cache: display 'needs_check' in status if it is set dm thin: display 'needs_check' in status if it is set dm thin: stay in out-of-data-space mode once no_space_timeout expires dm: fix use after free crash due to incorrect cleanup sequence Revert "dm: only run the queue on completion if congested or no requests pending" dm btree: silence lockdep lock inversion in dm_btree_del() dm thin: allocate the cell_sort_array dynamically dm btree remove: fix bug in redistribute3	2015-07-17 20:53:57 -07:00
Jens Axboe	2bb4cd5cc4	block: have drivers use blk_queue_max_discard_sectors() Some drivers use it now, others just set the limits field manually. But in preparation for splitting this into a hard and soft limit, ensure that they all call the proper function for setting the hw limit for discards. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-07-17 08:41:53 -06:00
Mike Snitzer	665022d72f	dm cache: avoid calls to prealloc_free_structs() if possible If no work was performed then prealloc_data_structs() wasn't ever called so there isn't any need to call prealloc_free_structs(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-16 22:32:08 -04:00
Mike Snitzer	e782eff591	dm cache: avoid preallocation if no work in writeback_some_dirty_blocks() Refactor writeback_some_dirty_blocks() to avoid prealloc_data_structs() if the policy doesn't have any dirty blocks ready for writeback. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-16 22:32:07 -04:00
Mike Snitzer	386cb7cdee	dm cache: do not wake_worker() in free_migration() All methods that queue work call wake_worker() as you'd expect. E.g. cell_defer, defer_bio, quiesce_migration (which is called by writeback, promote, demote_then_promote, invalidate, discard, etc). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-16 22:32:06 -04:00
Mike Snitzer	255eac2005	dm cache: display 'needs_check' in status if it is set There is currently no way to see that the needs_check flag has been set in the metadata. Display 'needs_check' in the cache status if it is set in the cache metadata. Also, update cache documentation. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-16 10:23:50 -04:00
Mike Snitzer	e4c78e210d	dm thin: display 'needs_check' in status if it is set There is currently no way to see that the needs_check flag has been set in the metadata. Display 'needs_check' in the thin-pool status if it is set in the thinp metadata. Also, update thinp documentation. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-16 10:23:50 -04:00
Mike Snitzer	bcc696fac1	dm thin: stay in out-of-data-space mode once no_space_timeout expires This fixes an issue where running out of data space would cause the thin-pool's metadata to become read-only. There was no reason to make metadata read-only -- calling set_pool_mode() with PM_READ_ONLY was a misguided way to error all queued and future write IOs. We can accomplish the same by degrading from PM_OUT_OF_DATA_SPACE to PM_OUT_OF_DATA_SPACE with error_if_no_space enabled. Otherwise, the use of PM_READ_ONLY could cause a race where commit() was started before the PM_READ_ONLY transition but dm_pool_commit_metadata() would go on to fail because the block manager had transitioned to read-only. The return of -EPERM from dm_pool_commit_metadata(), due to attempting to commit while in read-only mode, caused the thin-pool to set 'needs_check' because a metadata_operation_failed(). This needless cascade of failures makes life for users more difficult than needed. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-16 10:23:49 -04:00
Mikulas Patocka	b06075a98d	dm: fix use after free crash due to incorrect cleanup sequence Linux 4.2-rc1 Commit `0f20972f7b` ("dm: factor out a common cleanup_mapped_device()") moved a common cleanup code to a separate function. Unfortunately, that commit incorrectly changed the order of cleanup, so that it destroys the mapped_device's srcu structure 'io_barrier' before destroying its workqueue. The function that is executed on the workqueue (dm_wq_work) uses the srcu structure, thus it may use it after being freed. That results in a crash in the LVM test suite's mirror-vgreduce-removemissing.sh test. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `0f20972f7b` ("dm: factor out a common cleanup_mapped_device()") Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-13 09:14:11 -04:00
Jens Axboe	77b5a08427	bcache: don't embed 'return' statements in closure macros This is horribly confusing, it breaks the flow of the code without it being apparent in the caller. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Christoph Hellwig <hch@lst.de>	2015-07-11 09:57:32 -06:00
Mike Snitzer	621739b00e	Revert "dm: only run the queue on completion if congested or no requests pending" This reverts commit `9a0e609e3f`. (Resolved a conflict during revert due to commit `bfebd1cdb4` that came after) This revert is motivated by a couple failure reports on request-based DM multipath testbeds: 1) Netapp reported that their multipath fault injection test under heavy IO load can stall longer than 300 seconds. 2) IBM reported elevated lock contention in their testbed (likely due to increased back pressure due to IO not being dispatched as quickly): https://www.redhat.com/archives/dm-devel/2015-July/msg00057.html Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+	2015-07-08 16:16:07 -04:00
Joe Thornber	1c7518794a	dm btree: silence lockdep lock inversion in dm_btree_del() Allocate memory using GFP_NOIO when deleting a btree. dm_btree_del() can be called via an ioctl and we don't want to recurse into the FS or block layer. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-07-06 10:45:02 -04:00
Joe Thornber	a822c83e47	dm thin: allocate the cell_sort_array dynamically Given the pool's cell_sort_array holds 8192 pointers it triggers an order 5 allocation via kmalloc. This order 5 allocation is prone to failure as system memory gets more fragmented over time. Fix this by allocating the cell_sort_array using vmalloc. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-07-05 21:59:37 -04:00
Dennis Yang	4c7e309340	dm btree remove: fix bug in redistribute3 redistribute3() shares entries out across 3 nodes. Some entries were being moved the wrong way, breaking the ordering. This manifested as a BUG() in dm-btree-remove.c:shift() when entries were removed from the btree. For additional context see: https://www.redhat.com/archives/dm-devel/2015-May/msg00113.html Signed-off-by: Dennis Yang <shinrairis@gmail.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-07-05 17:43:58 -04:00
Linus Torvalds	1dc51b8288	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted VFS fixes and related cleanups (IMO the most interesting in that part are f_path-related things and Eric's descriptor-related stuff). UFS regression fixes (it got broken last cycle). 9P fixes. fs-cache series, DAX patches, Jan's file_remove_suid() work" [ I'd say this is much more than "fixes and related cleanups". The file_table locking rule change by Eric Dumazet is a rather big and fundamental update even if the patch isn't huge. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits) 9p: cope with bogus responses from server in p9_client_{read,write} p9_client_write(): avoid double p9_free_req() 9p: forgetting to cancel request on interrupted zero-copy RPC dax: bdev_direct_access() may sleep block: Add support for DAX reads/writes to block devices dax: Use copy_from_iter_nocache dax: Add block size note to documentation fs/file.c: __fget() and dup2() atomicity rules fs/file.c: don't acquire files->file_lock in fd_install() fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation vfs: avoid creation of inode number 0 in get_next_ino namei: make set_root_rcu() return void make simple_positive() public ufs: use dir_pages instead of ufs_dir_pages() pagemap.h: move dir_pages() over there remove the pointless include of lglock.h fs: cleanup slight list_entry abuse xfs: Correctly lock inode when removing suid and file capabilities fs: Call security_ops->inode_killpriv on truncate fs: Provide function telling whether file_remove_privs() will do anything ...	2015-07-04 19:36:06 -07:00
Joe Perches	d1aa1ab33d	MAINTAINERS: BCACHE: Kent Overstreet has changed email address Kent's email address in MAINTAINERS seems to be invalid. This was his last sign-off address, so use that if appropriate. Fix the S: status entry while there. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-06-30 19:45:01 -07:00
Pekka Enberg	958b43384e	bcache: use kvfree() in various places Use kvfree() instead of open-coding it. Signed-off-by: Pekka Enberg <penberg@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-06-30 19:45:00 -07:00
Linus Torvalds	6aaf0da872	md updates for 4.2 A mixed bag - a few bug fixes - some performance improvement that decrease lock contention - some clean-up Nothing major. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJVi6weAAoJEDnsnt1WYoG50CsP/RqFbZicRSIvzXUURwP+yCP0 3YZuURj4IXC6Cy/HLX+bZoj1p/b+GIRsZ72fWFJrd2LheaAI6WojCCLlnmXUtI/Y LIppF8/A2hfCNbF9cILByvrbzfndeEGK8kvootBDpvD0jlYiGePPAMQY2zx0MAyb T4yJ/KiziLniP6x7vqZrQ6I1MRVjeanN6RWXktFtixMpNOKUJe3PiZbUz4VDIrHR DaiHCbMjvRIkUWgNY8HmijEt+c8AYia7muqLj359dy2xF1hlUIdCx+61cgFD1zd8 enKDH3xp+3B9BEgHe+AtxTAzpqSgU93tdhUjGcy/orA+yYjAAcA4ifngrzfE3VKb kwQgPh2JvUrubavrcto0hthS5RldrCpDXebOM4aEq+7lDHCwrZ39Qio5+1F7TLt5 A5E3Eb7dPRdp9T3LrluX8/f7bO/Wbmxvv/RwnSLTpnGQoBWIAqCpQ+e9ro446Gsx /phXv3tE78fKj88LgQY/mm8ICeCppmQGLrpmjk9bkaZzqFdzQoURVmPh8QPMuJB4 iMHpOOKLzrUlW/23rRxaIKwPuFyxlNuLAvyA3ezsymGiZ+SqSeFCEm1jN64EfMCI 39rpfZt2pcVVOZJ9YeuzZG9wpie96yGZgnVWlP3FPjqRpboXqmtHlYA6EMRtqDAy mjSiGDF2bxkT1/YcjELD =sXTI -----END PGP SIGNATURE----- Merge tag 'md/4.2' of git://neil.brown.name/md Pull md updates from Neil Brown: "A mixed bag - a few bug fixes - some performance improvement that decrease lock contention - some clean-up Nothing major" * tag 'md/4.2' of git://neil.brown.name/md: md: clear Blocked flag on failed devices when array is read-only. md: unlock mddev_lock on an error path. md: clear mddev->private when it has been freed. md: fix a build warning md/raid5: ignore released_stripes check md/raid5: per hash value and exclusive wait_for_stripe md/raid5: split wait_for_stripe and introduce wait_for_quiescent wait: introduce wait_event_exclusive_cmd md: convert to kstrto*() md/raid10: make sync_request_write() call bio_copy_data()	2015-06-29 11:10:56 -07:00
Linus Torvalds	22165fa798	- Revert block and DM core changes the removed request-based DM's ability to handle partial request completions -- otherwise with the current SCSI LLDs these changes could lead to silent data corruption. - Fix two DM version bumps that were missing from the initial 4.2 DM pull request (enabled userspace lvm2 to know certain changes have been made). -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJVjWetAAoJEMUj8QotnQNaEngIAMVwExw0u04jqoW9rUwLDbpr PS2A4lh/MGtMqGGPwJp5qiwnKkgQ5/FcxRpslNQYqA6KrIlnjWJhacWl7tOrwqxn +WBsHIUwjcpwK2RqxSS3Petb6xDd7A3LfTQVhKV9xKZpZp8Y25a+1MPmUYKsFLBH DJ1d9bXPMdN1qjBXBU1rKkVxj6z8iNz/lv24eN0MGyWhfUUTc8lQg3eey3L0BzCc siOuupFQXaWIkbawLZrmvPPNm1iMoABC1OPZCTB1AYZYx1rqzEGUR1nZN+qWf6Wf rZtAPZehbRzvOaf5jC6tEfAcTF23aPEyp4LD+aAQpbuC/1IBi8a3S8z6PvR5EjA= =QY48 -----END PGP SIGNATURE----- Merge tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: "Apologies for not pressing this request-based DM partial completion issue further, it was an oversight on my part. We'll have to get it fixed up properly and revisit for a future release. - Revert block and DM core changes the removed request-based DM's ability to handle partial request completions -- otherwise with the current SCSI LLDs these changes could lead to silent data corruption. - Fix two DM version bumps that were missing from the initial 4.2 DM pull request (enabled userspace lvm2 to know certain changes have been made)" * tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache policy smq: fix "default" version to be 1.4.0 dm: bump the ioctl version to 4.32.0 Revert "block, dm: don't copy bios for request clones" Revert "dm: do not allocate any mempools for blk-mq request-based DM"	2015-06-26 12:35:01 -07:00
Linus Torvalds	47a469421d	Merge branch 'akpm' (patches from Andrew) Merge second patchbomb from Andrew Morton: - most of the rest of MM - lots of misc things - procfs updates - printk feature work - updates to get_maintainer, MAINTAINERS, checkpatch - lib/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits) exit,stats: /* obey this comment / coredump: add __printf attribute to cn_printf functions coredump: use from_kuid/kgid when formatting corename fs/reiserfs: remove unneeded cast NILFS2: support NFSv2 export fs/befs/btree.c: remove unneeded initializations fs/minix: remove unneeded cast init/do_mounts.c: add create_dev() failure log kasan: remove duplicate definition of the macro KASAN_FREE_PAGE fs/efs: femove unneeded cast checkpatch: emit "NOTE: <types>" message only once after multiple files checkpatch: emit an error when there's a diff in a changelog checkpatch: validate MODULE_LICENSE content checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr() checkpatch: fix processing of MEMSET issues checkpatch: suggest using ether_addr_equal*() checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files checkpatch: remove local from codespell path checkpatch: add --showfile to allow input via pipe to show filenames ...	2015-06-26 09:52:05 -07:00
Mike Snitzer	b5451e4568	dm cache policy smq: fix "default" version to be 1.4.0 Commit `bccab6a0` ("dm cache: switch the "default" cache replacement policy from mq to smq") should've incremented the "default" policy's version number to 1.4.0 rather than reverting to version 1.0.0. Reported-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-26 10:14:28 -04:00
Mike Snitzer	78d8e58a08	Revert "block, dm: don't copy bios for request clones" This reverts commit `5f1b670d0b`. Justification for revert as reported in this dm-devel post: https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html this change should not be pushed to mainline yet. Firstly, Christoph has a newer version of the patch that fixes silent data corruption problem: https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html And the new version still depends on LLDDs to always complete requests to the end when error happens, while block API doesn't enforce such a requirement. If the assumption is ever broken, the inconsistency between request and bio (e.g. rq->__sector and rq->bio) will cause silent data corruption: https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-26 10:11:58 -04:00
Mike Snitzer	4e6e36c371	Revert "dm: do not allocate any mempools for blk-mq request-based DM" This reverts commit `cbc4e3c135`. Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-26 10:11:07 -04:00
Rasmus Villemoes	90a9befb20	drivers/md/md.c: use strreplace() There's no point in starting over when we meet a '/'. This also eliminates a stack variable and a little .text. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-06-25 17:00:40 -07:00
Linus Torvalds	6597ac8a51	- DM core cleanups - blk-mq request-based DM no longer uses any mempools now that partial completions are no longer handled as part of cloned requests - DM raid cleanups and support for MD raid0 - DM cache core advances and a new stochastic-multi-queue (smq) cache replacement policy - smq is the new default dm-cache policy - DM thinp cleanups and much more efficient large discard support - DM statistics support for request-based DM and nanosecond resolution timestamps - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJViE/tAAoJEMUj8QotnQNag8wIAMhcmy46qGCy6SrOew6D8EQ0 4Ielsr5ZI67Q9pZVI1x/Zdt5GfDUGXiSEy/y8xoIuJfmsRXwnZ/gkOUyWpSUW8dH mgPUUpGF0aN2D/66P3chgm39rVZEt3crDY+SQu02Fm86JKlMUomFbWKgMXpsM6Ga HHW80zLF9Ca6D5m8xMze/ic1KBJtA9zaXcO9xnMCBym8mSaORtgwoCzKAgy43H7U KIBwPKR/y+c7SaKWGxXwxxhTKZDyP4FbgtiJ2Dc9yBDoD0RzbyuY/EFMEUPEbEMv e9fFHNEzmKlsNVY2vYXkVH4TdldWrrjkmYj/JVtz8cifoupWOOnDTjb2aqjAIww= =DKCC -----END PGP SIGNATURE----- Merge tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - DM core cleanups: * blk-mq request-based DM no longer uses any mempools now that partial completions are no longer handled as part of cloned requests - DM raid cleanups and support for MD raid0 - DM cache core advances and a new stochastic-multi-queue (smq) cache replacement policy * smq is the new default dm-cache policy - DM thinp cleanups and much more efficient large discard support - DM statistics support for request-based DM and nanosecond resolution timestamps - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt * tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits) dm stats: add support for request-based DM devices dm stats: collect and report histogram of IO latencies dm stats: support precise timestamps dm stats: fix divide by zero if 'number_of_areas' arg is zero dm cache: switch the "default" cache replacement policy from mq to smq dm space map metadata: fix occasional leak of a metadata block on resize dm thin metadata: fix a race when entering fail mode dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages dm thin: range discard support dm thin metadata: add dm_thin_remove_range() dm thin metadata: add dm_thin_find_mapped_range() dm btree: add dm_btree_remove_leaves() dm stats: Use kvfree() in dm_kvfree() dm cache: age and write back cache entries even without active IO dm cache: prefix all DMERR and DMINFO messages with cache device name dm cache: add fail io mode and needs_check flag dm cache: wake the worker thread every time we free a migration object dm cache: add stochastic-multi-queue (smq) policy dm cache: boost promotion of blocks that will be overwritten dm cache: defer whole cells ...	2015-06-25 16:34:39 -07:00
Linus Torvalds	e4bc13adfd	Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach\|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...	2015-06-25 16:00:17 -07:00
Linus Torvalds	bfffa1cc9d	Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block Pull core block IO update from Jens Axboe: "Nothing really major in here, mostly a collection of smaller optimizations and cleanups, mixed with various fixes. In more detail, this contains: - Addition of policy specific data to blkcg for block cgroups. From Arianna Avanzini. - Various cleanups around command types from Christoph. - Cleanup of the suspend block I/O path from Christoph. - Plugging updates from Shaohua and Jeff Moyer, for blk-mq. - Eliminating atomic inc/dec of both remaining IO count and reference count in a bio. From me. - Fixes for SG gap and chunk size support for data-less (discards) IO, so we can merge these better. From me. - Small restructuring of blk-mq shared tag support, freeing drivers from iterating hardware queues. From Keith Busch. - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the IOPS mode the default for non-rotational storage" * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits) cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL cfq-iosched: fix sysfs oops when attempting to read unconfigured weights cfq-iosched: move group scheduling functions under ifdef cfq-iosched: fix the setting of IOPS mode on SSDs blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file block, cgroup: implement policy-specific per-blkcg data block: Make CFQ default to IOPS mode on SSDs block: add blk_set_queue_dying() to blkdev.h blk-mq: Shared tag enhancements block: don't honor chunk sizes for data-less IO block: only honor SG gap prevention for merges that contain data block: fix returnvar.cocci warnings block, dm: don't copy bios for request clones block: remove management of bi_remaining when restoring original bi_end_io block: replace trylock with mutex_lock in blkdev_reread_part() block: export blkdev_reread_part() and __blkdev_reread_part() suspend: simplify block I/O handling block: collapse bio bit space block: remove unused BIO_RW_BLOCK and BIO_EOF flags block: remove BIO_EOPNOTSUPP ...	2015-06-25 14:29:53 -07:00
Neil Brown	ab16bfc732	md: clear Blocked flag on failed devices when array is read-only. The Blocked flag indicates that a device has failed but that this fact hasn't been recorded in the metadata yet. Writes to such devices cannot be allowed until the metadata has been updated. On a read-only array, the Blocked flag will never be cleared. This prevents the device being removed from the array. If the metadata is being handled by the kernel (i.e. !mddev->external), then we can be sure that if the array is switch to writable, then a metadata update will happen and will record the failure. So we don't need the flag set. If metadata is externally managed, it is upto the external manager to clear the 'blocked' flag. Reported-by: XiaoNi <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-25 17:16:49 +10:00
NeilBrown	9a8c0fa861	md: unlock mddev_lock on an error path. This error path retuns while still holding the lock - bad. Fixes: `6791875e2e` ("md: make reconfig_mutex optional for writes to md sysfs files.") Cc: stable@vger.kernel.org (v4.0+) Signed-off-by: NeilBrown <neilb@suse.com>	2015-06-25 17:14:09 +10:00
NeilBrown	bd6919228d	md: clear mddev->private when it has been freed. If ->private is set when ->run is called, it is assumed to be a 'config' prepared as part of 'reshape'. So it is important when we free that config, that we also clear ->private. This is not often a problem as the mddev will normally be discarded shortly after the config us freed. However if an 'assemble' races with a final close, the assemble can use the old mddev which has a stale ->private. This leads to any of various sorts of crashes. So clear ->private after calling ->free(). Reported-by: Nate Clark <nate@neworld.us> Cc: stable@vger.kernel.org (v4.0+) Fixes: `afa0f557cb` ("md: rename ->stop to ->free") Signed-off-by: NeilBrown <neilb@suse.com>	2015-06-25 17:14:09 +10:00
Miklos Szeredi	2726d56620	vfs: add seq_file_path() helper Turn seq_path(..., &file->f_path, ...); into seq_file_path(..., file, ...); Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-06-23 18:01:07 -04:00
Miklos Szeredi	9bf39ab2ad	vfs: add file_path() helper Turn d_path(&file->f_path, ...); into file_path(file, ...); Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-06-23 18:00:05 -04:00
Mikulas Patocka	e262f34741	dm stats: add support for request-based DM devices This makes it possible to use dm stats with DM multipath. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-17 12:40:41 -04:00
Mikulas Patocka	dfcfac3e4c	dm stats: collect and report histogram of IO latencies Add an option to dm statistics to collect and report a histogram of IO latencies. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-17 12:40:40 -04:00
Mikulas Patocka	c96aec344d	dm stats: support precise timestamps Make it possible to use precise timestamps with nanosecond granularity in dm statistics. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-17 12:40:40 -04:00
Mikulas Patocka	dd4c1b7d0c	dm stats: fix divide by zero if 'number_of_areas' arg is zero If the number_of_areas argument was zero the kernel would crash on div-by-zero. Add better input validation. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v3.12+	2015-06-17 12:40:39 -04:00
Mike Snitzer	bccab6a01a	dm cache: switch the "default" cache replacement policy from mq to smq The Stochastic multiqueue (SMQ) policy (vs MQ) offers the promise of less memory utilization, improved performance and increased adaptability in the face of changing workloads. SMQ also does not have any cumbersome tuning knobs. Users may switch from "mq" to "smq" simply by appropriately reloading a DM table that is using the cache target. Doing so will cause all of the mq policy's hints to be dropped. Also, performance of the cache may degrade slightly until smq recalculates the origin device's hotspots that should be cached. In the future the "mq" policy will just silently make use of "smq" and the mq code will be removed. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2015-06-17 12:40:38 -04:00
Joe Thornber	6096d91af0	dm space map metadata: fix occasional leak of a metadata block on resize The metadata space map has a simplified 'bootstrap' mode that is operational when extending the space maps. Whilst in this mode it's possible for some refcount decrement operations to become queued (eg, as a result of shadowing one of the bitmap indexes). These decrements were not being applied when switching out of bootstrap mode. The effect of this bug was the leaking of a 4k metadata block. This is detected by the latest version of thin_check as a non fatal error. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-06-17 10:09:23 -04:00
Firo Yang	4e02361232	md: fix a build warning Warning like this: drivers/md/md.c: In function "update_array_info": drivers/md/md.c:6394:26: warning: logical not is only applied to the left hand side of comparison [-Wlogical-not-parentheses] !mddev->persistent != info->not_persistent\|\| Fix it as Neil Brown said: mddev->persistent != !info->not_persistent \|\| Signed-off-by: Firo Yang <firogm@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-17 10:00:38 +10:00
Shaohua Li	713bc5c2de	md/raid5: ignore released_stripes check conf->released_stripes list isn't always related to where there are free stripes pending. Active stripes can be in the list too. And even free stripes were active very recently. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-17 10:00:33 +10:00
Yuanhan Liu	e9e4c377e2	md/raid5: per hash value and exclusive wait_for_stripe I noticed heavy spin lock contention at get_active_stripe() with fsmark multiple thread write workloads. Here is how this hot contention comes from. We have limited stripes, and it's a multiple thread write workload. Hence, those stripes will be taken soon, which puts later processes to sleep for waiting free stripes. When enough stripes(>= 1/4 total stripes) are released, all process are woken, trying to get the lock. But there is one only being able to get this lock for each hash lock, making other processes spinning out there for acquiring the lock. Thus, it's effectiveless to wakeup all processes and let them battle for a lock that permits one to access only each time. Instead, we could make it be a exclusive wake up: wake up one process only. That avoids the heavy spin lock contention naturally. To do the exclusive wake up, we've to split wait_for_stripe into multiple wait queues, to make it per hash value, just like the hash lock. Here are some test results I have got with this patch applied(all test run 3 times): `fsmark.files_per_sec' ===================== next-20150317 this patch ------------------------- ------------------------- metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params ------------------------- ------------------------- -------- ------------------------------ 25.600 ±0.0 92.700 ±2.5 262.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose 25.600 ±0.0 77.800 ±0.6 203.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose 32.000 ±0.0 93.800 ±1.7 193.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose 32.000 ±0.0 81.233 ±1.7 153.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose 48.800 ±14.5 99.667 ±2.0 104.2% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose 6.400 ±0.0 12.800 ±0.0 100.0% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose 63.133 ±8.2 82.800 ±0.7 31.2% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose 245.067 ±0.7 306.567 ±7.9 25.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose 17.533 ±0.3 21.000 ±0.8 19.8% ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose 188.167 ±1.9 215.033 ±3.1 14.3% ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync 254.500 ±1.8 290.733 ±2.4 14.2% ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync `time.system_time' ===================== next-20150317 this patch ------------------------- ------------------------- metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params ------------------------- ------------------------- -------- ------------------------------ 7235.603 ±1.2 185.163 ±1.9 -97.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose 7666.883 ±2.9 202.750 ±1.0 -97.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose 14567.893 ±0.7 421.230 ±0.4 -97.1% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose 3697.667 ±14.0 148.190 ±1.7 -96.0% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose 5572.867 ±3.8 310.717 ±1.4 -94.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose 5565.050 ±0.5 313.277 ±1.5 -94.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose 2420.707 ±17.1 171.043 ±2.7 -92.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose 3743.300 ±4.6 379.827 ±3.5 -89.9% ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose 3308.687 ±6.3 363.050 ±2.0 -89.0% ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose Where, 1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark 1t, 64t: where 't' means thread 4M: means the single file size, corresponding to the '-s' option of fsmark 40G, 30G, 120G: means the total test size 4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means the size of one ramdisk. So, it would be 48G in total. And we made a raid on those ramdisk As you can see, though there are no much performance gain for hard disk workload, the system time is dropped heavily, up to 97%. And as expected, the performance increased a lot, up to 260%, for fast device(ram disk). v2: use bits instead of array to note down wait queue need to wake up. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-17 10:00:27 +10:00
Yuanhan Liu	b1b4648648	md/raid5: split wait_for_stripe and introduce wait_for_quiescent I noticed heavy spin lock contention at get_active_stripe(), introduced at being wake up stage, where a bunch of processes try to re-hold the spin lock again. After giving some thoughts on this issue, I found the lock could be relieved(and even avoided) if we turn the wait_for_stripe to per waitqueue for each lock hash and make the wake up exclusive: wake up one process each time, which avoids the lock contention naturally. Before go hacking with wait_for_stripe, I found it actually has 2 usages: for the array to enter or leave the quiescent state, and also to wait for an available stripe in each of the hash lists. So this patch splits the first usage off into a separate wait_queue, wait_for_quiescent, and the next patch will turn the second usage into one waitqueue for each hash value, and make it exclusive, to relieve the lock contention. v2: wake_up(wait_for_quiescent) when (active_stripes == 0) Commit log refactor suggestion from Neil. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-17 10:00:21 +10:00
Alexey Dobriyan	4c9309c0cc	md: convert to kstrto() Convert away from deprecated simple_strto() functions. Add "fit into sector_t" checks. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-17 10:00:06 +10:00
Kent Overstreet	c31df25f20	md/raid10: make sync_request_write() call bio_copy_data() Refactor sync_request_write() of md/raid10 to use bio_copy_data() instead of open coding bio_vec iterations. Cc: Christoph Hellwig <hch@infradead.org> Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <mlin@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-17 09:59:57 +10:00
NeilBrown	ea358cd0d2	md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync MD_RECOVERY_DONE is normally cleared by md_check_recovery after a resync etc finished. However it is possible for raid5_start_reshape to race and start a reshape before MD_RECOVERY_DONE is cleared. This can lean to multiple reshapes running at the same time, which isn't good. To make sure it is cleared before starting a reshape, and also clear it when reaping a thread, just to be safe. Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-12 20:16:33 +10:00
NeilBrown	8e8e2518fc	md: Close race when setting 'action' to 'idle'. Checking ->sync_thread without holding the mddev_lock() isn't really safe, even after flushing the workqueue which ensures md_start_sync() has been run. While this code is waiting for the lock, md_check_recovery could reap the thread itself, and then start another thread (e.g. recovery might finish, then reshape starts). When this thread gets the lock md_start_sync() hasn't run so it doesn't get reaped, but MD_RECOVERY_RUNNING gets cleared. This allows two threads to start which leads to confusion. So don't both if MD_RECOVERY_RUNNING isn't set, but if it is do the flush and the test and the reap all under the mddev_lock to avoid any race with md_check_recovery. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `6791875e2e` ("md: make reconfig_mutex optional for writes to md sysfs files.") Cc: stable@vger.kernel.org (v4.0+)	2015-06-12 20:16:26 +10:00
NeilBrown	c008f1d356	md: don't return 0 from array_state_store Returning zero from a 'store' function is bad. The return value should be either len length of the string or an error. So use 'len' if 'err' is zero. Fixes: `6791875e2e` ("md: make reconfig_mutex optional for writes to md sysfs files.") Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel (v4.0+)	2015-06-12 20:16:16 +10:00
Joe Thornber	b1f11aff04	dm thin metadata: fix a race when entering fail mode In dm_thin_find_block() the ->fail_io flag was checked outside the metadata device's root_lock, causing dm_thin_find_block() to race with the setting of this flag. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:06 -04:00
Mike Snitzer	fd467696e8	dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages Use EOPNOTSUPP, rather than EINVAL, error code when user attempts to send the pool a message. Otherwise usespace is led to believe the message failed due to invalid argument. Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:05 -04:00
Joe Thornber	34fbcf6257	dm thin: range discard support Previously REQ_DISCARD bios have been split into block sized chunks before submission to the thin target. There are a couple of issues with this: - If the block size is small, a large discard request can get broken up into a great many bios which is both slow and causes a lot of memory pressure. - The thin pool block size and the discard granularity for the underlying data device need to be compatible if we want to passdown the discard. This patch relaxes the block size granularity for thin devices. It makes use of the recent range locking added to the bio_prison to quiesce a whole range of thin blocks before unmapping them. Once a thin range has been unmapped the discard can then be passed down to the data device for those sub ranges where the data blocks are no longer used (ie. they weren't shared in the first place). This patch also doesn't make any apologies about open-coding portions of block core as a means to supporting async discard completions in the near-term -- if/when late bio splitting lands it'll all get cleaned up. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:05 -04:00
Joe Thornber	6550f075f5	dm thin metadata: add dm_thin_remove_range() Removes a range of blocks from the btree. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:04 -04:00
Joe Thornber	a5d895a90b	dm thin metadata: add dm_thin_find_mapped_range() Retrieve the next run of contiguously mapped blocks. Useful for working out where to break up IO. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:03 -04:00
Joe Thornber	4ec331c3ea	dm btree: add dm_btree_remove_leaves() Removes a range of leaf values from the tree. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:03 -04:00
Pekka Enberg	0f24b79b52	dm stats: Use kvfree() in dm_kvfree() Use kvfree() instead of open-coding it. Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:02 -04:00
Joe Thornber	fba10109a4	dm cache: age and write back cache entries even without active IO The policy tick() method is normally called from interrupt context. Both the mq and smq policies do some bottom half work for the tick method in their map functions. However if no IO is going through the cache, then that bottom half work doesn't occur. With these policies this means recently hit entries do not age and do not get written back as early as we'd like. Fix this by introducing a new 'can_block' parameter to the tick() method. When this is set the bottom half work occurs immediately. 'can_block' is set when the tick method is called every second by the core target (not in interrupt context). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:01 -04:00
Mike Snitzer	b61d950962	dm cache: prefix all DMERR and DMINFO messages with cache device name Having the DM device name associated with the ERR or INFO message is very helpful. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:01 -04:00
Joe Thornber	028ae9f76f	dm cache: add fail io mode and needs_check flag If a cache metadata operation fails (e.g. transaction commit) the cache's metadata device will abort the current transaction, set a new needs_check flag, and the cache will transition to "read-only" mode. If aborting the transaction or setting the needs_check flag fails the cache will transition to "fail-io" mode. Once needs_check is set the cache device will not be allowed to activate. Activation requires write access to metadata. Future work is needed to add proper support for running the cache in read-only mode. Once in fail-io mode the cache will report a status of "Fail". Also, add commit() wrapper that will disallow commits if in read_only or fail mode. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:00 -04:00
Joe Thornber	88bf5184fa	dm cache: wake the worker thread every time we free a migration object When the cache is idle, writeback work was only being issued every second. With this change outstanding writebacks are streamed constantly. This offers a writeback performance improvement. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:13:00 -04:00
Joe Thornber	66a6363566	dm cache: add stochastic-multi-queue (smq) policy The stochastic-multi-queue (smq) policy addresses some of the problems with the current multiqueue (mq) policy. Memory usage ------------ The mq policy uses a lot of memory; 88 bytes per cache block on a 64 bit machine. SMQ uses 28bit indexes to implement it's data structures rather than pointers. It avoids storing an explicit hit count for each block. It has a 'hotspot' queue rather than a pre cache which uses a quarter of the entries (each hotspot block covers a larger area than a single cache block). All these mean smq uses ~25bytes per cache block. Still a lot of memory, but a substantial improvement nontheless. Level balancing --------------- MQ places entries in different levels of the multiqueue structures based on their hit count (~ln(hit count)). This means the bottom levels generally have the most entries, and the top ones have very few. Having unbalanced levels like this reduces the efficacy of the multiqueue. SMQ does not maintain a hit count, instead it swaps hit entries with the least recently used entry from the level above. The over all ordering being a side effect of this stochastic process. With this scheme we can decide how many entries occupy each multiqueue level, resulting in better promotion/demotion decisions. Adaptability ------------ The MQ policy maintains a hit count for each cache block. For a different block to get promoted to the cache it's hit count has to exceed the lowest currently in the cache. This means it can take a long time for the cache to adapt between varying IO patterns. Periodically degrading the hit counts could help with this, but I haven't found a nice general solution. SMQ doesn't maintain hit counts, so a lot of this problem just goes away. In addition it tracks performance of the hotspot queue, which is used to decide which blocks to promote. If the hotspot queue is performing badly then it starts moving entries more quickly between levels. This lets it adapt to new IO patterns very quickly. Performance ----------- In my tests SMQ shows substantially better performance than MQ. Once this matures a bit more I'm sure it'll become the default policy. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-11 17:12:59 -04:00
Tejun Heo	66114cad64	writeback: separate out include/linux/backing-dev-defs.h With the planned cgroup writeback support, backing-dev related declarations will be more widely used across block and cgroup; unfortunately, including backing-dev.h from include/linux/blkdev.h makes cyclic include dependency quite likely. This patch separates out backing-dev-defs.h which only has the essential definitions and updates blkdev.h to include it. c files which need access to more backing-dev details now include backing-dev.h directly. This takes backing-dev.h off the common include dependency chain making it a lot easier to use it across block and cgroup. v2: fs/fat build failure fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:34 -06:00
Tejun Heo	4452226ea2	writeback: move backing_dev_info->state into bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->state into wb. * enum bdi_state is renamed to wb_state and the prefix of all enums is changed from BDI_ to WB_. * Explicit zeroing of bdi->state is removed without adding zeoring of wb->state as the whole data structure is zeroed on init anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: drbd-dev@lists.linbit.com Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:34 -06:00
Linus Torvalds	0f1e5b5d19	Quite a few fixes for DM's blk-mq support thanks to extra DM multipath testing from Junichi Nomura and Bart Van Assche. Also fix a casting bug in dm_merge_bvec() that could cause only a single page to be added to a bio (Joe identified this while testing dm-cache writeback). -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJVaLGPAAoJEMUj8QotnQNa1pcH/3TumkM0rvl1DK7DfWvjgfpp +h+whWn79lOUzan/AszRvRU4Pwr7symndnJzJ/tqQ76fm8x84obvKa0jFfB9rmlM sgw45alqB+AKcKAvocsuDIdzy2/K8FHBihccPWhe87AtJRlj57nTOJWO06gzExOp /AJ/0Vgv3pxUCL2M4XucpQOV8eUg+gpwjDMxuMEsPVzBVYJfuRvNKoERtd9LSCSV Y5iqdHCJiW0SWesZNrfthAAlReFzqb1c3mq55G0q0C+Z77Jf3TIPBhVfouM4NOKt wjQoPavzyAEiKfLFy6QFCjhSOXKCI3klqO7zcyzyygVue5UTHICYRJKn7TFtbfU= =xGAV -----END PGP SIGNATURE----- Merge tag 'dm-4.1-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device-mapper fixes from Mike Snitzer: "Quite a few fixes for DM's blk-mq support thanks to extra DM multipath testing from Junichi Nomura and Bart Van Assche. Also fix a casting bug in dm_merge_bvec() that could cause only a single page to be added to a bio (Joe identified this while testing dm-cache writeback)" * tag 'dm-4.1-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: fix casting bug in dm_merge_bvec() dm: fix reload failure of 0 path multipath mapping on blk-mq devices dm: fix false warning in free_rq_clone() for unmapped requests dm: requeue from blk-mq dm_mq_queue_rq() using BLK_MQ_RQ_QUEUE_BUSY dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error path dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPED dm: run queue on re-queue	2015-05-29 14:39:24 -07:00
Joe Thornber	40775257b9	dm cache: boost promotion of blocks that will be overwritten When considering whether to move a block to the cache we already give preferential treatment to discarded blocks, since they are cheap to promote (no read of the origin required since the data is junk). The same is true of blocks that are about to be completely overwritten, so we likewise boost their promotion chances. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:07 -04:00
Joe Thornber	651f5fa2a3	dm cache: defer whole cells Currently individual bios are deferred to the worker thread if they cannot be processed immediately (eg, a block is in the process of being moved to the fast device). This patch passes whole cells across to the worker. This saves reaquiring the cell, and also collects bios destined for the same block together, which allows them to be mapped with a single look up to the policy. This reduces the overhead of using dm-cache. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:06 -04:00
Joe Thornber	3cdf93f9d8	dm bio prison: add dm_cell_promote_or_release() Rather than always releasing the prisoners in a cell, the client may want to promote one of them to be the new holder. There is a race here though between releasing an empty cell, and other threads adding new inmates. So this function makes the decision with its lock held. This function can have two outcomes: i) An inmate is promoted to be the holder of the cell (return value of 0). ii) The cell has no inmate for promotion and is released (return value of 1). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:06 -04:00
Joe Thornber	451b9e0071	dm cache: pull out some bitset utility functions for reuse Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:05 -04:00
Joe Thornber	20f6814b94	dm cache: pass a new 'critical' flag to the policies when requesting writeback work We only allow non critical writeback if the origin is idle. It is up to the policy to decide what writeback work is critical. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:04 -04:00
Joe Thornber	066dbaa386	dm cache: track IO to the origin device using io_tracker Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:04 -04:00
Joe Thornber	77289d3207	dm cache: add io_tracker A little class that keeps track of the volume of io that is in flight, and the length of time that a device has been idle for. FIXME: rather than jiffes, may be best to use ktime_t (to support faster devices). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:03 -04:00
Joe Thornber	fb4100ae7f	dm cache: fix race when issuing a POLICY_REPLACE operation There is a race between a policy deciding to replace a cache entry, the core target writing back any dirty data from this block, and other IO threads doing IO to the same block. This sort of problem is avoided most of the time by the core target grabbing a bio prison cell before making the request to the policy. But for a demotion the core target doesn't know which block will be demoted, so can't do this in advance. Fix this demotion race by introducing a callback to the policy interface that allows the policy to grab the cell on behalf of the core target. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-05-29 14:19:03 -04:00
Milan Broz	54cea3f668	dm crypt: add comments to better describe crypto processing logic A crypto driver can process requests synchronously or asynchronously and can use an internal driver queue to backlog requests. Add some comments to clarify internal logic and completion return codes. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:02 -04:00
Lidong Zhong	ed63287dd6	dm raid1: keep issuing IO after leg failure Currently if there is a leg failure, the bio will be put into the hold list until userspace does a remove/replace on the leg. Doing so in a cluster config (clvmd) is problematic because there may be a temporary path failure that results in cluster raid1 remove/replace. Such recovery takes a long time due to a full resync. Update dm-raid1 to optionally ignore these failures so bios continue being issued without interrupton. To enable this feature userspace must pass "keep_log" when creating the dm-raid1 device. Signed-off-by: Lidong Zhong <lzhong@suse.com> Tested-by: Liuhua Wang <lwang@suse.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:02 -04:00
Geert Uytterhoeven	f4ad317aed	dm log writes: use ULL suffix for 64-bit constants On 32-bit: drivers/md/dm-log-writes.c: In function ‘log_super’: drivers/md/dm-log-writes.c:323: warning: integer constant is too large for ‘long’ type Add a ULL suffix to WRITE_LOG_MAGIC to fix this. Also add a ULL suffix to WRITE_LOG_VERSION as it's stored in a __le64 field. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:01 -04:00
Luis Henriques	e223e1de4f	dm stripe: drop useless exit point from dm_stripe_init() Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:01 -04:00
Heinz Mauelshagen	0cf4503174	dm raid: add support for the MD RAID0 personality Add dm-raid access to the MD RAID0 personality to enable single zone striping. The following changes enable that access: - add type definition to raid_types array - make bitmap creation conditonal in super_validate(), because bitmaps are not allowed in raid0 - set rdev->sectors to the data image size in super_validate() to allow the raid0 personality to calculate the MD array size properly - use mdddev(un)lock() functions instead of direct mutex_(un)lock() (wrapped in here because it's a trivial change) - enhance raid_status() to always report full sync for raid0 so that userspace checks for 100% sync will succeed and allow for resize (and takeover/reshape once added in future paches) - enhance raid_resume() to not load bitmap in case of raid0 - add merge function to avoid data corruption (seen with readahead) that resulted from bio payloads that grew too large. This problem did not occur with the other raid levels because it either did not apply without striping (raid1) or was avoided via stripe caching. - raise version to 1.7.0 because of the raid0 API change Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:00 -04:00
Heinz Mauelshagen	c76d53f43e	dm raid: a few cleanups - ensure maximum device limit in superblock - rename DMPF_* (print flags) to CTR_FLAG_* (constructor flags) and their respective struct raid_set member - use strcasecmp() in raid10_format_to_md_layout() as in the constructor Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:19:00 -04:00
Heinz Mauelshagen	0f4106b32f	dm raid: fixup documentation for discard support Remove comment above parse_raid_params() that claims "devices_handle_discard_safely" is a table line argument when it is actually is a module parameter. Also, backfill dm-raid target version 1.6.0 documentation. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:18:59 -04:00
Mike Snitzer	49f154c732	dm thin metadata: remove in-core 'read_only' flag Leverage the block manager's read_only flag instead of duplicating it; access with new dm_bm_is_read_only() method. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:18:59 -04:00
Mike Snitzer	f8ae75253e	dm thin: cleanup schedule_zero() to read more logically The overwrite has only ever about optimizing away the need to zero a block if the entire block was being overwritten. As such it is only relevant when zeroing is enabled. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com>	2015-05-29 14:18:58 -04:00
Mike Snitzer	8b908f8e94	dm thin: cleanup overwrite's endio restore to be centralized Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:18:58 -04:00
Mike Snitzer	0f20972f7b	dm: factor out a common cleanup_mapped_device() Introduce a single common method for cleaning up a DM device's mapped_device. No functional change, just eliminates duplication of delicate mapped_device cleanup code. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:18:58 -04:00
Mike Snitzer	2d76fff18f	dm: cleanup methods that requeue requests More often than not a request that is requeued _is_ mapped (meaning the clone request is allocated and clone->q is initialized). Rename dm_requeue_unmapped_original_request() to avoid potential confusion due to function name containing "unmapped". Also, remove dm_requeue_unmapped_request() since callers can easily call the dm_requeue_original_request() directly. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:18:57 -04:00
Mike Snitzer	cbc4e3c135	dm: do not allocate any mempools for blk-mq request-based DM Do not allocate the io_pool mempool for blk-mq request-based DM (DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools(). Also refine __bind_mempools() to have more precise awareness of which mempools each type of DM device uses -- avoids mempool churn when reloading DM tables (particularly for DM_TYPE_REQUEST_BASED). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:18:57 -04:00
Mike Snitzer	183f7802e7	Merge remote-tracking branch 'jens/for-4.2/core' into dm-4.2	2015-05-29 14:17:16 -04:00
Joe Thornber	1c220c69ce	dm: fix casting bug in dm_merge_bvec() dm_merge_bvec() was originally added in f6fccb ("dm: introduce merge_bvec_fn"). In that commit a value in sectors is converted to bytes using << 9, and then assigned to an int. This code made assumptions about the value of BIO_MAX_SECTORS. A later commit 148e51 ("dm: improve documentation and code clarity in dm_merge_bvec") was meant to have no functional change but it removed the use of BIO_MAX_SECTORS in favor of using queue_max_sectors(). At this point the cast from sector_t to int resulted in a zero value. The fallout being dm_merge_bvec() would only allow a single page to be added to a bio. This interim fix is minimal for the benefit of stable@ because the more comprehensive cleanup of passing a sector_t to all DM targets' merge function will impact quite a few DM targets. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.19+	2015-05-29 13:41:16 -04:00
Junichi Nomura	15b94a6904	dm: fix reload failure of 0 path multipath mapping on blk-mq devices dm-multipath accepts 0 path mapping. # echo '0 2097152 multipath 0 0 0 0' \| dmsetup create newdev Such a mapping can be used to release underlying devices while still holding requests in its queue until working paths come back. However, once the multipath device is created over blk-mq devices, it rejects reloading of 0 path mapping: # echo '0 2097152 multipath 0 0 1 1 queue-length 0 1 1 /dev/sda 1' \ \| dmsetup create mpath1 # echo '0 2097152 multipath 0 0 0 0' \| dmsetup load mpath1 device-mapper: reload ioctl on mpath1 failed: Invalid argument Command failed With following kernel message: device-mapper: ioctl: can't change device type after initial table load. DM tries to inherit the current table type using dm_table_set_type() but it doesn't work as expected because of unnecessary check about whether the target type is hybrid or not. Hybrid type is for targets that work as either request-based or bio-based and not required for blk-mq or non blk-mq checking. Fixes: `65803c2059` ("dm table: train hybrid target type detection to select blk-mq if appropriate") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 13:41:16 -04:00
Linus Torvalds	c492e2d464	Assorted fixes for new RAID5 stripe-batching functionality. Unfortunately this functionality was merged a little prematurely. The necessary testing and code review is now complete (or as complete as it can be) and to code passes a variety of tests and looks quite sensible. Also a fix for some recent locking changes - a race was introduced which causes a reshape request to sometimes fail. No data safety issues. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUAVWf/sDnsnt1WYoG5AQJ8hQ/+KIGUijacXXUXBE4QuO1DMTkltV61bk6E TJQ6fTuMvXeOuyGm+BoSFTJrOJiP6/PVxl4jnAkjLlvAK/JVKekG0PXv2flmD9EJ udK/g8d2k+4L2O0uiGdGSfOQaEaQ4OQvNmQOP9GF/FXNdyYfZbSJnxG+kzWnStGZ 3LNEMoDok9TiUDVSJ3PgibnUHYr3zNJFjBGszfRW0HqXBRWM5TI6HQ0bWwrm61mQ sIOvFeS7CVOBQWW7zkY3uvz/g7dpuPlXqmDOomF+prKlU320SrpSDDBD2Qg56rXh 8YGAzLPV8R6xB5hjGFnoHtvxF/f5Fntb3WbC5az0zv+q/phDYA9Nd2UN5APemyGB PJuxW4Ojq2DWIAvmf0HQEkvjJlqeugCgIQXJJ8yvIaBXJJjit1jMSEXjolM4vlLh h6Su/hwoyTi9NxdYpFeR6JuyHjzTyrjyBkbW8y12wVQjmDncBdKtieYZX4TvPxVz n7Qrk2bpFhR/icP6eYWCvt6iwU1e+5lXNb/18AYm9bJe5BE5/N1X0azrxbdZT4cl 1DvQw2HAMBGp+nSr+R1lqO4yX+busBZUTYsaGvH4T7Ubs+UjwgTE3tPoevj6w829 0/7r/UPfSn0XFbd5rrPY+bOBsAOIMDG5g3mj7K7+38sVeX9VOVN4sGftS5dWTr9e RQBTZAK0+qI= =Y0Vm -----END PGP SIGNATURE----- Merge tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md Pull m,ore md bugfixes gfrom Neil Brown: "Assorted fixes for new RAID5 stripe-batching functionality. Unfortunately this functionality was merged a little prematurely. The necessary testing and code review is now complete (or as complete as it can be) and to code passes a variety of tests and looks quite sensible. Also a fix for some recent locking changes - a race was introduced which causes a reshape request to sometimes fail. No data safety issues" * tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md: md: fix race when unfreezing sync_action md/raid5: break stripe-batches when the array has failed. md/raid5: call break_stripe_batch_list from handle_stripe_clean_event md/raid5: be more selective about distributing flags across batch. md/raid5: add handle_flags arg to break_stripe_batch_list. md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list md/raid5: remove condition test from check_break_stripe_batch_list. md/raid5: Ensure a batch member is not handled prematurely. md/raid5: close race between STRIPE_BIT_DELAY and batching. md/raid5: ensure whole batch is delayed for all required bitmap updates.	2015-05-29 10:35:21 -07:00
Mike Snitzer	e5d8de32cc	dm: fix false warning in free_rq_clone() for unmapped requests When stacking request-based dm device on non blk-mq device and device-mapper target could not map the request (error target is used, multipath target with all paths down, etc), the WARN_ON_ONCE() in free_rq_clone() will trigger when it shouldn't. The warning was added by commit `aa6df8d` ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request"). But free_rq_clone() with clone->q == NULL is valid usage for the case where dm_kill_unmapped_request() initiates request cleanup. Fix this false warning by just removing the WARN_ON -- it only generated false positives and was never useful in catching the intended case (completing clone request not being mapped e.g. clone->q being NULL). Fixes: `aa6df8d` ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request") Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 11:07:36 -04:00
NeilBrown	56ccc1125b	md: fix race when unfreezing sync_action A recent change removed the need for locking around writing to "sync_action" (and various other places), but introduced a subtle race. When e.g. setting 'reshape' on a 'frozen' array, the 'frozen' flag is cleared before 'reshape' is set, so the md thread can get in and start trying recovery - which isn't wanted. So instead of clearing MD_RECOVERY_FROZEN for any command except 'frozen', only clear it when each specific command is parsed. This allows the handling of 'reshape' to clear the bit while a lock is held. Also remove some places where we set MD_RECOVERY_NEEDED, as it is always set on non-error exit of the function. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `6791875e2e` ("md: make reconfig_mutex optional for writes to md sysfs files.")	2015-05-28 18:04:45 +10:00
NeilBrown	626f2092c8	md/raid5: break stripe-batches when the array has failed. Once the array has too much failure, we need to break stripe-batches up so they can all be dealt with. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:48:59 +10:00
NeilBrown	787b76fa37	md/raid5: call break_stripe_batch_list from handle_stripe_clean_event Now that the code in break_stripe_batch_list() is nearly identical to the end of handle_stripe_clean_event, replace the later with a function call. The only remaining difference of any interest is the masking that is applieds to dev[i].flags copied from head_sh. R5_WriteError certainly isn't wanted as it is set per-stripe, not per-patch. R5_Overlap isn't wanted as it is explicitly handled. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:47:02 +10:00
NeilBrown	1b956f7a8f	md/raid5: be more selective about distributing flags across batch. When a batch of stripes is broken up, we keep some of the flags that were per-stripe, and copy other flags from the head to all others. This only happens while a stripe is being handled, so many of the flags are irrelevant. The "SYNC_FLAGS" (which I've renamed to make it clear there are several) and STRIPE_DEGRADED are set per-stripe and so need to be preserved. STRIPE_INSYNC is the only flag that is set on the head that needs to be propagated to all others. For safety, add a WARN_ON if others are set, except: STRIPE_HANDLE - this is safe and per-stripe and we are going to set in several cases anyway STRIPE_INSYNC STRIPE_IO_STARTED - this is just a hint and doesn't hurt. STRIPE_ON_PLUG_LIST STRIPE_ON_RELEASE_LIST - It is a point pointless for a batched stripe to be on one of these lists, but it can happen as can be safely ignored. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:40:01 +10:00
NeilBrown	3960ce7961	md/raid5: add handle_flags arg to break_stripe_batch_list. When we break a stripe_batch_list we sometimes want to set STRIPE_HANDLE on the individual stripes, and sometimes not. So pass a 'handle_flags' arg. If it is zero, always set STRIPE_HANDLE (on non-head stripes). If not zero, only set it if any of the given flags are present. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:39:30 +10:00
NeilBrown	fb642b92c2	md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list break_stripe_batch list didn't clear head_sh->batch_head. This was probably a bug. Also clear all R5_Overlap flags and if any were cleared, wake up 'wait_for_overlap'. This isn't always necessary but the worst effect is a little extra checking for code that is waiting on wait_for_overlap. Also, don't use wake_up_nr() because that does the wrong thing if 'nr' is zero, and it number of flags cleared doesn't strongly correlate with the number of threads to wake. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:36:25 +10:00
NeilBrown	4e3d62ff49	md/raid5: remove condition test from check_break_stripe_batch_list. handle_stripe_clean_event() contains a chunk of code very similar to check_break_stripe_batch_list(). If we make the latter more like the former, we can end up with just one copy of this code. This first step removed the condition (and the 'check_') part of the name. This has the added advantage of making it clear what check is being performed at the point where the function is called. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:36:06 +10:00
NeilBrown	b15a9dbdbf	md/raid5: Ensure a batch member is not handled prematurely. If a stripe is a member of a batch, but not the head, it must not be handled separately from the rest of the batch. 'clear_batch_ready()' handles this requirement to some extent but not completely. If a member is passed to handle_stripe() a second time it returns '0' indicating the stripe can be handled, which is wrong. So add an extra test. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:35:47 +10:00
NeilBrown	d0852df543	md/raid5: close race between STRIPE_BIT_DELAY and batching. When we add a write to a stripe we need to make sure the bitmap bit is set. While doing that the stripe is not locked so it could be added to a batch after which further changes to STRIPE_BIT_DELAY and ->bm_seq are ineffective. So we need to hold off adding to a stripe until bitmap_startwrite has completed at least once, and we need to avoid further changes to STRIPE_BIT_DELAY once the stripe has been added to a batch. If a bitmap_startwrite() completes after the stripe was added to a batch, it will not have set the bit, only incremented a counter, so no extra delay of the stripe is needed. Reported-by: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:34:40 +10:00
NeilBrown	2b6b245742	md/raid5: ensure whole batch is delayed for all required bitmap updates. When we add a stripe to a batch, we need to be sure that head stripe will wait for the bitmap update required for the new stripe. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-28 11:29:14 +10:00
Mike Snitzer	45714fbed4	dm: requeue from blk-mq dm_mq_queue_rq() using BLK_MQ_RQ_QUEUE_BUSY Use BLK_MQ_RQ_QUEUE_BUSY to requeue a blk-mq request directly from the DM blk-mq device's .queue_rq. This cleans up the previous convoluted handling of request requeueing that would return BLK_MQ_RQ_QUEUE_OK (even though it wasn't) and then run blk_mq_requeue_request() followed by blk_mq_kick_requeue_list(). Also, document that DM blk-mq ontop of old request_fn devices cannot fail in clone_rq() since the clone request is preallocated as part of the pdu. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-27 17:37:23 -04:00
Mike Snitzer	4c6dd53dd3	dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error path Otherwise kmemleak reported: unreferenced object 0xffff88009b14e2b0 (size 16): comm "fio", pid 4274, jiffies 4294978034 (age 1253.210s) hex dump (first 16 bytes): 40 12 f3 99 01 88 ff ff 00 10 00 00 00 00 00 00 @............... backtrace: [<ffffffff81600029>] kmemleak_alloc+0x49/0xb0 [<ffffffff811679a8>] kmem_cache_alloc+0xf8/0x160 [<ffffffff8111c950>] mempool_alloc_slab+0x10/0x20 [<ffffffff8111cb37>] mempool_alloc+0x57/0x150 [<ffffffffa04d2b61>] __multipath_map.isra.17+0xe1/0x220 [dm_multipath] [<ffffffffa04d2cb5>] multipath_clone_and_map+0x15/0x20 [dm_multipath] [<ffffffffa02889b5>] map_request.isra.39+0xd5/0x220 [dm_mod] [<ffffffffa028b0e4>] dm_mq_queue_rq+0x134/0x240 [dm_mod] [<ffffffff812cccb5>] __blk_mq_run_hw_queue+0x1d5/0x380 [<ffffffff812ccaa5>] blk_mq_run_hw_queue+0xc5/0x100 [<ffffffff812ce350>] blk_sq_make_request+0x240/0x300 [<ffffffff812c0f30>] generic_make_request+0xc0/0x110 [<ffffffff812c0ff2>] submit_bio+0x72/0x150 [<ffffffff811c07cb>] do_blockdev_direct_IO+0x1f3b/0x2da0 [<ffffffff811c166e>] __blockdev_direct_IO+0x3e/0x40 [<ffffffff8120aa1a>] ext4_direct_IO+0x1aa/0x390 Fixes: `e5863d9ad` ("dm: allocate requests in target when stacking on blk-mq devices") Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.0+	2015-05-27 17:37:22 -04:00
Junichi Nomura	3a1407559a	dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPED When stacking request-based DM on blk_mq device, request cloning and remapping are done in a single call to target's clone_and_map_rq(). The clone is allocated and valid only if clone_and_map_rq() returns DM_MAPIO_REMAPPED. The "IS_ERR(clone)" check in map_request() does not cover all the !DM_MAPIO_REMAPPED cases that are possible (E.g. if underlying devices are not ready or unavailable, clone_and_map_rq() may return DM_MAPIO_REQUEUE without ever having established an ERR_PTR). Fix this by explicitly checking for a return that is not DM_MAPIO_REMAPPED in map_request(). Without this fix, DM core may call setup_clone() for a NULL clone and oops like this: BUG: unable to handle kernel NULL pointer dereference at 0000000000000068 IP: [<ffffffff81227525>] blk_rq_prep_clone+0x7d/0x137 ... CPU: 2 PID: 5793 Comm: kdmwork-253:3 Not tainted 4.0.0-nm #1 ... Call Trace: [<ffffffffa01d1c09>] map_tio_request+0xa9/0x258 [dm_mod] [<ffffffff81071de9>] kthread_worker_fn+0xfd/0x150 [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24 [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24 [<ffffffff81071fdd>] kthread+0xe6/0xee [<ffffffff81093a59>] ? put_lock_stats+0xe/0x20 [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b [<ffffffff814c2d98>] ret_from_fork+0x58/0x90 [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b Fixes: `e5863d9ad` ("dm: allocate requests in target when stacking on blk-mq devices") Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.0+	2015-05-27 09:48:51 -04:00
Junichi Nomura	4ae9944d13	dm: run queue on re-queue Without kicking queue, requeued request may stay forever in the queue if there are no other I/O activities to the device. The original error had been in v2.6.39 with commit `7eaceaccab` ("block: remove per-queue plugging"), which replaced conditional plugging by periodic runqueue. Commit `9d1deb83d4` in v4.1-rc1 removed the periodic runqueue and the problem started to manifest. Fixes: `9d1deb83d4` ("dm: don't schedule delayed run of the queue if nothing to do") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-26 09:57:36 -04:00
Linus Torvalds	a30ec4b347	md fixes for 4.1-rc4 - one serious RAID0 data corruption - caused by recent bugfix that wasn't reviewed properly. - one raid5 fix in new code (a couple more of those to come). - one little fix to stop static analysis complaining about silly rcu annotation. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUAVV6rmDnsnt1WYoG5AQICpQ//VleOWnZERlVQN1nYzGHqIyPUz+dV6x8W RfzccMzUWqaHpZY/BaTMqJxPWgt6lxqqtN5F3u+XMdM+L2Pl0NXoRTqB+FzpLucR d0p76FwbW2IWnycxXUZtGjm8xT942KPdqXIP+LQRKaIhguV9P9th9WbPRvlXFfB6 pO1h1AdxVlsfn6qgBpe49pYoUuNmpwSTTJ8Gvb7YRsHXnmB/lsjQvVsWLiMqPwug +bR0gM07OvTfYenrY82ztu42+xAoVoeu/XhHmPPLEV3S/SA/9kVLgBcWjq2bBjw0 REzR2i+exEtCRe2yvetDX8320IKZcZSOE8BwBB2iUm8OVjplPxw83Bk/kB2+Zr8n VxAa9/vCZPLNCWyzOSi2V471NeFBys9PVkVIroHT5nQZGadkw0JYt1cEbcHSBKS/ NYmJCu1IM0o9DPzn8jW7sXjDcB0WE8VwFMJ09QSiEYQu2bAb7j2f1pcoweN2MeNF qOitGj/luygXg2TBvema9qPL533DUj+eSYyyO7OQBsfkfZSDM5di4bGoZBVw118Q kARaJ3IZffT8gsVpgSTCsviDv5QUcs0izVS1sMKB6/SDJjtLXBMERW+LkrmjBsnB 6+CMmwNZUIzugzBgSPBVozdNQQYjAXL7LNV6Q9YQhE0RK0+10fLLxmpovJR62bRf PWlg23+n4+E= =jXKM -----END PGP SIGNATURE----- Merge tag 'md/4.1-rc4-fixes' of git://neil.brown.name/md Pull md bugfixes from Neil Brown: "I have a few more raid5 bugfixes pending, but I want them to get a bit more review first. In the meantime: - one serious RAID0 data corruption - caused by recent bugfix that wasn't reviewed properly. - one raid5 fix in new code (a couple more of those to come). - one little fix to stop static analysis complaining about silly rcu annotation" * tag 'md/4.1-rc4-fixes' of git://neil.brown.name/md: md/bitmap: remove rcu annotation from pointer arithmetic. md/raid0: fix restore to sector variable in raid0_make_request raid5: fix broken async operation chain	2015-05-22 15:10:07 -07:00
Christoph Hellwig	5f1b670d0b	block, dm: don't copy bios for request clones Currently dm-multipath has to clone the bios for every request sent to the lower devices, which wastes cpu cycles and ties down memory. This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio to not complete bios attached to a request, which we set on clone requests similar to bios in a flush sequence. With this change I/O errors on a path failure only get propagated to dm-multipath, which can then either resubmit the I/O or complete the bios on the original request. I've done some basic testing of this on a Linux target with ALUA support, and it survives path failures during I/O nicely. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-22 08:58:57 -06:00
Mike Snitzer	326e1dbb57	block: remove management of bi_remaining when restoring original bi_end_io Commit `c4cf5261` ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") regressed all existing callers that followed this pattern: 1) saving a bio's original bi_end_io 2) wiring up an intermediate bi_end_io 3) restoring the original bi_end_io from intermediate bi_end_io 4) calling bio_endio() to execute the restored original bi_end_io The regression was due to BIO_CHAIN only ever getting set if bio_inc_remaining() is called. For the above pattern it isn't set until step 3 above (step 2 would've needed to establish BIO_CHAIN). As such the first bio_endio(), in step 2 above, never decremented __bi_remaining before calling the intermediate bi_end_io -- leaving __bi_remaining with the value 1 instead of 0. When bio_inc_remaining() occurred during step 3 it brought it to a value of 2. When the second bio_endio() was called, in step 4 above, it should've called the original bi_end_io but it didn't because there was an extra reference that wasn't dropped (due to atomic operations being optimized away since BIO_CHAIN wasn't set upfront). Fix this issue by removing the __bi_remaining management complexity for all callers that use the above pattern -- bio_chain() is the only interface that _needs_ to be concerned with __bi_remaining. For the above pattern callers just expect the bi_end_io they set to get called! Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls that aren't associated with the bio_chain() interface. Also, the bio_inc_remaining() interface has been moved local to bio.c. Fixes: `c4cf5261` ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-22 08:58:55 -06:00
NeilBrown	8532e34390	md/bitmap: remove rcu annotation from pointer arithmetic. Evaluating "&mddev->disks" is simple pointer arithmetic, so it does not need 'rcu' annotations - no dereferencing is happening. Also enhance the comment to explain that 'rdev' in that case is not actually a pointer to an rdev. Reported-by: Patrick Marlier <patrick.marlier@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-21 09:14:41 +10:00
Eric Work	a81157768a	md/raid0: fix restore to sector variable in raid0_make_request The variable "sector" in "raid0_make_request()" was improperly updated by a call to "sector_div()" which modifies its first argument in place. Commit `47d68979cc` restored this variable after the call for later re-use. Unfortunetly the restore was done after the referenced variable "bio" was advanced. This lead to the original value and the restored value being different. Here we move this line to the proper place. One observed side effect of this bug was discarding a file though unlinking would cause an unrelated file's contents to be discarded. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `47d68979cc` ("md/raid0: fix bug with chunksize not a power of 2.") Cc: stable@vger.kernel.org (any that received above backport) URL: https://bugzilla.kernel.org/show_bug.cgi?id=98501	2015-05-21 09:14:25 +10:00
Shaohua Li	487696957e	raid5: fix broken async operation chain ops_run_reconstruct6() doesn't correctly chain asyn operations. The tx returned by async_gen_syndrome should be added as the dependent tx of next stripe. The issue is introduced by commit `59fc630b8b` RAID5: batch adjacent full stripe write Reported-and-tested-by: Maxime Ripard <maxime.ripard@free-electrons.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-21 09:14:20 +10:00
Linus Torvalds	c91aa67eed	A few fixes for md. Most of these are related to the new "batched stripe writeout", but there are a few others. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUAVVAgfznsnt1WYoG5AQKNsBAAw+yXVnRlLfzTPOhiIolvdmdmck6ghpV7 eTHTbDEVkDo+a+caU01mhkHtL2wEuxGqr4XHTASIM9Xjn2YLhH14HOtBPl6+IddR F58X+SN8/pK4tzc5P42AAbfGfqeJSBPSl49L94+Pd9BsagNYZOUx7c/WK5lKay6O ZQRvI3Bafg7VTzHCw931sN11yY/PS+45zpOentk5TlavRPcZfuVTh1QzSyAUvbtN sAFK2+ISflj8+TW0rLo44c3kkC7ARFKcq5lJt4WsQUntJTjiIROvqvaEP+IUZkGr y+yvCJbskJ6B03XrcuvdkFrp3Qe0yWtHCN6qKhc8AI/c/pcuz24xTYQ6V6LDyAy8 N2z6m1Kri97qxZpOONNh31WOCmVyFmwiz7+54soEeTDg3xgTb60UiZe8T+9ZcxoY sN8ksVSGGapgQUUGYcidbZla6uSWNjZq9hQgDJFUwG3HhvT3E/qiUT9kqAxfPsLE Vxw/KQ5GkW+7ldaPmGuv29hc3THjsiLY5Bx0kBlmgao3fcT61WbNcZPJOQfMxlT7 6qoJSiNoq612TaHj3X06FXhMGE/fwjwjOO+PpULxM47giHcKrVR/qlG4kREYa2vj UgV05rP/S+k3Mce4dHzvUuE57t0ADff0d/mZvhfzERjNdwTMdWgia89XLg7D06rk UGB/o2IwWug= =tapF -----END PGP SIGNATURE----- Merge tag 'md/4.1-rc3-fixes' of git://neil.brown.name/md Pull md bugfixes from Neil Brown: "A few fixes for md. Most of these are related to the new "batched stripe writeout", but there are a few others" * tag 'md/4.1-rc3-fixes' of git://neil.brown.name/md: md/raid5: fix handling of degraded stripes in batches. md/raid5: fix allocation of 'scribble' array. md/raid5: don't record new size if resize_stripes fails. md/raid5: avoid reading parity blocks for full-stripe write to degraded array md/raid5: more incorrect BUG_ON in handle_stripe_fill. md/raid5: new alloc_stripe() to allocate an initialize a stripe. md-raid0: conditional mddev->queue access to suit dm-raid	2015-05-11 10:33:31 -07:00
Linus Torvalds	5d5df5ee7c	Two additional fixes for changes introduced via DM during the 4.1 merge window: The first reverts a dm-crypt change that wasn't correct. The second fixes a device format regression that impacted userspace. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJVTRZOAAoJEMUj8QotnQNaW7EIAJlHn+S8czm1Cb4gWBn7kg+X vzH5NIzr/SpDX8o3R8NBdrB8rgqTm4jQZrptbmgLG+j9XoaQupuFyNCiaAw47v2G P/WYlodNwTkb3I48XjwCRo00MtR3cEJ8ywNzvEJUvgPkgMMIzhieHsVT9L8bZv3n XDs8JzZyF966U0BeCjF4oDAazUrpEvWf0h4C5L47g8C0UQI7aGwYKoSvZm3DAImP awbJbnqtQuoRcI0HISHrjYi1vghgnmJY6aSx3tYSJPTNRkFNqgap7eZrUacicnOH bUVL3snBVebK3JMJhJXgfGW/FeeP9juhEY08JNTOZ5wa6BNuru0GHeqKuI3arHY= =jlAN -----END PGP SIGNATURE----- Merge tag 'dm-4.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: "Two additional fixes for changes introduced via DM during the 4.1 merge window. The first reverts a dm-crypt change that wasn't correct. The second fixes a device format regression that impacted userspace" * tag 'dm-4.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: init: fix regression by supporting devices with major:minor:offset format Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY"	2015-05-08 20:38:21 -07:00
Linus Torvalds	1daac193f2	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A collection of fixes since the merge window; - fix for a double elevator module release, from Chao Yu. Ancient bug. - the splice() MORE flag fix from Christophe Leroy. - a fix for NVMe, fixing a patch that went in in the merge window. From Keith. - two fixes for blk-mq CPU hotplug handling, from Ming Lei. - bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md. - two blk-mq fixes from Shaohua, fixing a race on queue stop and a bad merge issue with FUA writes. - division-by-zero fix for writeback from Tejun. - a block bounce page accounting fix, making sure we inc/dec after bouncing so that pre/post IO pages match up. From Wang YanQing" * 'for-linus' of git://git.kernel.dk/linux-block: splice: sendfile() at once fails for big files blk-mq: don't lose requests if a stopped queue restarts blk-mq: fix FUA request hang block: destroy bdi before blockdev is unregistered. block:bounce: fix call inc_\|dec_zone_page_state on different pages confuse value of NR_BOUNCE elevator: fix double release of elevator module writeback: use \|1 instead of +1 to protect against div by zero blk-mq: fix CPU hotplug handling blk-mq: fix race between timeout and CPU hotplug NVMe: Fix VPD B0 max sectors translation	2015-05-08 19:49:35 -07:00
NeilBrown	bb27051f9f	md/raid5: fix handling of degraded stripes in batches. There is no need for special handling of stripe-batches when the array is degraded. There may be if there is a failure in the batch, but STRIPE_DEGRADED does not imply an error. So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is degraded. This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in check_break_stripe_batch_list() and so the bitmap bit gets cleared when it shouldn't. So in check_break_stripe_batch_list(), split the batch up completely - again STRIPE_DEGRADED isn't meaningful. Also don't set STRIPE_BATCH_ERR when there is a write error to a replacement device. This simply removes the replacement device and requires no extra handling. Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-08 18:47:57 +10:00
NeilBrown	738a273806	md/raid5: fix allocation of 'scribble' array. As the new 'scribble' array is sized based on chunk size, we need to make sure the size matches the largest of 'old' and 'new' chunk sizes when the array is undergoing reshape. We also potentially need to resize it even when not resizing the stripe cache, as chunk size can change without changing number of devices. So move the 'resize' code into a separate function, and consider old and new sizes when allocating. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `46d5b78562` ("raid5: use flex_array for scribble data")	2015-05-08 18:47:48 +10:00
NeilBrown	6e9eac2dce	md/raid5: don't record new size if resize_stripes fails. If any memory allocation in resize_stripes fails we will return -ENOMEM, but in some cases we update conf->pool_size anyway. This means that if we try again, the allocations will be assumed to be larger than they are, and badness results. So only update pool_size if there is no error. This bug was introduced in 2.6.17 and the patch is suitable for -stable. Fixes: `ad01c9e375` ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array") Cc: stable@vger.kernel.org (v2.6.17+) Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-08 18:47:35 +10:00
NeilBrown	10d82c5f0d	md/raid5: avoid reading parity blocks for full-stripe write to degraded array When performing a reconstruct write, we need to read all blocks that are not being over-written .. except the parity (P and Q) blocks. The code currently reads these (as they are not being over-written!) unnecessarily. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `ea664c8245` ("md/raid5: need_this_block: tidy/fix last condition.")	2015-05-08 18:47:17 +10:00
NeilBrown	b0c783b323	md/raid5: more incorrect BUG_ON in handle_stripe_fill. It is not incorrect to call handle_stripe_fill() when a batch of full-stripe writes is active. It is, however, a BUG if fetch_block() then decides it needs to actually fetch anything. So move the 'BUG_ON' to where it belongs. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `59fc630b8b` ("RAID5: batch adjacent full stripe write")	2015-05-08 18:46:52 +10:00
NeilBrown	f18c1a35f6	md/raid5: new alloc_stripe() to allocate an initialize a stripe. The new batch_lock and batch_list fields are being initialized in grow_one_stripe() but not in resize_stripes(). This causes a crash on resize. So separate the core initialization into a new function and call it from both allocation sites. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: `59fc630b8b` ("RAID5: batch adjacent full stripe write")	2015-05-08 18:40:01 +10:00
Heinz Mauelshagen	b6538fe329	md-raid0: conditional mddev->queue access to suit dm-raid This patch is a prerequisite for dm-raid "raid0" support to allow dm-raid to access the MD RAID0 personality doing unconditional accesses to mddev->queue, which is NULL in case of dm-raid stacked on top of MD. Most of the conditional mddev->queue accesses made it to upstream but this missing one, which prohibits md raid0 to set disk stack limits (being done in dm core in case of md underneath dm). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-05-08 18:39:40 +10:00
Jens Axboe	dac56212e8	bio: skip atomic inc/dec of ->bi_cnt for most use cases Struct bio has a reference count that controls when it can be freed. Most uses cases is allocating the bio, which then returns with a single reference to it, doing IO, and then dropping that single reference. We can remove this atomic_dec_and_test() in the completion path, if nobody else is holding a reference to the bio. If someone does call bio_get() on the bio, then we flag the bio as now having valid count and that we must properly honor the reference count when it's being put. Tested-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-05 13:32:49 -06:00
Jens Axboe	c4cf5261f8	bio: skip atomic inc/dec of ->bi_remaining for non-chains Struct bio has an atomic ref count for chained bio's, and we use this to know when to end IO on the bio. However, most bio's are not chained, so we don't need to always introduce this atomic operation as part of ending IO. Add a helper to elevate the bi_remaining count, and flag the bio as now actually needing the decrement at end_io time. Rename the field to __bi_remaining to catch any current users of this doing the incrementing manually. For high IOPS workloads, this reduces the overhead of bio_endio() substantially. Tested-by: Robert Elliott <elliott@hp.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-05 13:32:47 -06:00
Rabin Vincent	c0403ec0bb	Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY" This reverts Linux 4.1-rc1 commit `0618764cb2`. The problem which that commit attempts to fix actually lies in the Freescale CAAM crypto driver not dm-crypt. dm-crypt uses CRYPTO_TFM_REQ_MAY_BACKLOG. This means the the crypto driver should internally backlog requests which arrive when the queue is full and process them later. Until the crypto hw's queue becomes full, the driver returns -EINPROGRESS. When the crypto hw's queue if full, the driver returns -EBUSY, and if CRYPTO_TFM_REQ_MAY_BACKLOG is set, is expected to backlog the request and process it when the hardware has queue space. At the point when the driver takes the request from the backlog and starts processing it, it calls the completion function with a status of -EINPROGRESS. The completion function is called (for a second time, in the case of backlogged requests) with a status/err of 0 when a request is done. Crypto drivers for hardware without hardware queueing use the helpers, crypto_init_queue(), crypto_enqueue_request(), crypto_dequeue_request() and crypto_get_backlog() helpers to implement this behaviour correctly, while others implement this behaviour without these helpers (ccp, for example). dm-crypt (before the patch that needs reverting) uses this API correctly. It queues up as many requests as the hw queues will allow (i.e. as long as it gets back -EINPROGRESS from the request function). Then, when it sees at least one backlogged request (gets -EBUSY), it waits till that backlogged request is handled (completion gets called with -EINPROGRESS), and then continues. The references to af_alg_wait_for_completion() and af_alg_complete() in that commit's commit message are irrelevant because those functions only handle one request at a time, unlink dm-crypt. The problem is that the Freescale CAAM driver, which that commit describes as having being tested with, fails to implement the backlogging behaviour correctly. In cam_jr_enqueue(), if the hardware queue is full, it simply returns -EBUSY without backlogging the request. What the observed deadlock was is not described in the commit message but it is obviously the wait_for_completion() in crypto_convert() where dm-crypto would wait for the completion being called with -EINPROGRESS in the case of backlogged requests. This completion will never be completed due to the bug in the CAAM driver. Commit `0618764cb2` incorrectly made dm-crypt wait for every request, even when the driver/hardware queues are not full, which means that dm-crypt will never see -EBUSY. This means that that commit will cause a performance regression on all crypto drivers which implement the API correctly. Revert it. Correct backlog handling should be implemented in the CAAM driver instead. Cc'ing stable purely because commit `0618764cb2` did. If for some reason a stable@ kernel did pick up commit `0618764cb2` it should get reverted. Signed-off-by: Rabin Vincent <rabin.vincent@axis.com> Reviewed-by: Horia Geanta <horia.geanta@freescale.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-05 12:16:43 -04:00
Mike Snitzer	aa6df8dd28	dm: fix free_rq_clone() NULL pointer when requeueing unmapped request Commit `022333427a` ("dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq") mistakenly removed free_rq_clone()'s clone->q check before testing clone->q->mq_ops. It was an oversight to discontinue that check for 1 of the 2 use-cases for free_rq_clone(): 1) free_rq_clone() called when an unmapped original request is requeued 2) free_rq_clone() called in the request-based IO completion path The clone->q check made sense for case #1 but not for #2. However, we cannot just reinstate the check as it'd mask a serious bug in the IO completion case #2 -- no in-flight request should have an uninitialized request_queue (basic block layer refcounting _should_ ensure this). The NULL pointer seen for case #1 is detailed here: https://www.redhat.com/archives/dm-devel/2015-April/msg00160.html Fix this free_rq_clone() NULL pointer by simply checking if the mapped_device's type is DM_TYPE_MQ_REQUEST_BASED (clone's queue is blk-mq) rather than checking clone->q->mq_ops. This avoids the need to dereference clone->q, but a WARN_ON_ONCE is added to let us know if an uninitialized clone request is being completed. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-30 10:25:21 -04:00
Christoph Hellwig	3e6180f0c8	dm: only initialize the request_queue once Commit `bfebd1cdb4` ("dm: add full blk-mq support to request-based DM") didn't properly account for the need to short-circuit re-initializing DM's blk-mq request_queue if it was already initialized. Otherwise, reloading a blk-mq request-based DM table (either manually or via multipathd) resulted in errors, see: https://www.redhat.com/archives/dm-devel/2015-April/msg00132.html Fix is to only initialize the request_queue on the initial table load (when the mapped_device type is assigned). This is better than having dm_init_request_based_blk_mq_queue() return early if the queue was already initialized because it elevates the constraint to a more meaningful location in DM core. As such the pre-existing early return in dm_init_request_based_queue() can now be removed. Fixes: `bfebd1cdb4` ("dm: add full blk-mq support to request-based DM") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-30 10:25:21 -04:00
NeilBrown	6cd18e711d	block: destroy bdi before blockdev is unregistered. Because of the peculiar way that md devices are created (automatically when the device node is opened), a new device can be created and registered immediately after the blk_unregister_region(disk_devt(disk), disk->minors); call in del_gendisk(). Therefore it is important that all visible artifacts of the previous device are removed before this call. In particular, the 'bdi'. Since: commit `c4db59d31e` Author: Christoph Hellwig <hch@lst.de> fs: don't reassign dirty inodes to default_backing_dev_info moved the device_unregister(bdi->dev); call from bdi_unregister() to bdi_destroy() it has been quite easy to lose a race and have a new (e.g.) "md127" be created after the blk_unregister_region() call and before bdi_destroy() is ultimately called by the final 'put_disk', which must come after del_gendisk(). The new device finds that the bdi name is already registered in sysfs and complains > [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70() > [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127' We can fix this by moving the bdi_destroy() call out of blk_release_queue() (which can happen very late when a refcount reaches zero) and into blk_cleanup_queue() - which happens exactly when the md device driver calls it. Then it is only necessary for md to call blk_cleanup_queue() before del_gendisk(). As loop.c devices are also created on demand by opening the device node, we make the same change there. Fixes: `c4db59d31e` Reported-by: Azat Khuzhin <a3at.mail@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org (v4.0) Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-04-27 10:27:20 -06:00
Linus Torvalds	474095e46c	md updates for 4.1 Highlights: - "experimental" code for managing md/raid1 across a cluster using DLM. Code is not ready for general use and triggers a WARNING if used. However it is looking good and mostly done and having in mainline will help co-ordinate development. - RAID5/6 can now batch multiple (4K wide) stripe_heads so as to handle a full (chunk wide) stripe as a single unit. - RAID6 can now perform read-modify-write cycles which should help performance on larger arrays: 6 or more devices. - RAID5/6 stripe cache now grows and shrinks dynamically. The value set is used as a minimum. - Resync is now allowed to go a little faster than the 'mininum' when there is competing IO. How much faster depends on the speed of the devices, so the effective minimum should scale with device speed to some extent. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUAVTbIxDnsnt1WYoG5AQIgAA//Z9FlEpkHcJJ75WGjrXgGJPyNTfEZOkoz jnD8PpBY2Afp341vMatd0XKErdEAhuPQmJMAa+tbxht6pk77X3fSzXghGA8FEafg yazn5pfBt6NXmepV4vhl/+LNYuWRxxLSA9EDm7wEg+tiO0UEts+a0w2TSXzcT1w+ 30Yi1EjAlTaJ5yjlBHUOtTXWE43D6RKnUr6FMy2dRlnzlFyDlezRMChWo9v6OkQF YqJ20FcvmJdLHY/6Yif3jvpm7eQecMqdCZENTvW/mJI86zqf6E+ToCYS1VNjfDnK ud61iU9eshu4WtNWBG6KLuHBD0grO1NaEL7/S16w1KdNJMhYgiK8WussvIAJEesA 5SlETM7Y/1XFq8puwlAq2/tuPfhZ+TFxnAwce/C3hMTDcYAACnS/R6INFQXqGvy3 nX1NLogrCycX8oqxv3jTFKLVqIVwlkSlHcUGzIWjcfCF37StcXFKI5q862agyg2+ NNocFMuXhPPM1YcB9JJSo2nCsor4e9tTdVEZlFm2B3cc8LJ9BLWUMSoi1h7VK/1g P7psnPIjz7/cdI2TZTFjGTZ0Kvhx/NTYp41AZealDNxeGWUNM+5xGZnUF8QRBc/E 0dGHtEAah834BDQFvNnJtuuh/s+KwbvswjNP+njoBsHjIQIvngDABpOwpIkdqF6r diQ2gUPnHN0= =OHG6 -----END PGP SIGNATURE----- Merge tag 'md/4.1' of git://neil.brown.name/md Pull md updates from Neil Brown: "More updates that usual this time. A few have performance impacts which hould mostly be positive, but RAID5 (in particular) can be very work-load ensitive... We'll have to wait and see. Highlights: - "experimental" code for managing md/raid1 across a cluster using DLM. Code is not ready for general use and triggers a WARNING if used. However it is looking good and mostly done and having in mainline will help co-ordinate development. - RAID5/6 can now batch multiple (4K wide) stripe_heads so as to handle a full (chunk wide) stripe as a single unit. - RAID6 can now perform read-modify-write cycles which should help performance on larger arrays: 6 or more devices. - RAID5/6 stripe cache now grows and shrinks dynamically. The value set is used as a minimum. - Resync is now allowed to go a little faster than the 'mininum' when there is competing IO. How much faster depends on the speed of the devices, so the effective minimum should scale with device speed to some extent" * tag 'md/4.1' of git://neil.brown.name/md: (58 commits) md/raid5: don't do chunk aligned read on degraded array. md/raid5: allow the stripe_cache to grow and shrink. md/raid5: change ->inactive_blocked to a bit-flag. md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe md/raid5: pass gfp_t arg to grow_one_stripe() md/raid5: introduce configuration option rmw_level md/raid5: activate raid6 rmw feature md/raid6 algorithms: xor_syndrome() for SSE2 md/raid6 algorithms: xor_syndrome() for generic int md/raid6 algorithms: improve test program md/raid6 algorithms: delta syndrome functions raid5: handle expansion/resync case with stripe batching raid5: handle io error of batch list RAID5: batch adjacent full stripe write raid5: track overwrite disk count raid5: add a new flag to track if a stripe can be batched raid5: use flex_array for scribble data md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid md: allow resync to go faster when there is competing IO. md: remove 'go_faster' option from ->sync_request() ...	2015-04-24 09:28:01 -07:00
Eric Mei	9ffc8f7cb9	md/raid5: don't do chunk aligned read on degraded array. When array is degraded, read data landed on failed drives will result in reading rest of data in a stripe. So a single sequential read would result in same data being read twice. This patch is to avoid chunk aligned read for degraded array. The downside is to involve stripe cache which means associated CPU overhead and extra memory copy. Test Results: Following test are done on a enterprise storage node with Seagate 6T SAS drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2, chunk size 128 KiB. I use FIO, using direct-io with various bs size, enough queue depth, tested sequential and 100% random read against 3 array config: 1) optimal, as baseline; 2) degraded; 3) degraded with this patch. Kernel version is 4.0-rc3. Each individual test I only did once so there might be some variations, but we just focus on big trend. Sequential Read: bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s) 1024 1608 656 995 512 1624 710 956 256 1635 728 980 128 1636 771 983 64 1612 1119 1000 32 1580 1420 1004 16 1368 688 986 8 768 647 953 4 411 413 850 Random Read: bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS) 1024 163 160 156 512 274 273 272 256 426 428 424 128 576 592 591 64 726 724 726 32 849 848 837 16 900 970 971 8 927 940 929 4 948 940 955 Some notes: * In sequential + optimal, as bs size getting smaller, the FIO thread become CPU bound. * In sequential + degraded, there's big increase when bs is 64K and 32K, I don't have explanation. * In sequential + degraded-with-patch, the MD thread mostly become CPU bound. If you want to we can discuss specific data point in those data. But in general it seems with this patch, we have more predictable and in most cases significant better sequential read performance when array is degraded, and almost no noticeable impact on random read. Performance is a complicated thing, the patch works well for this particular configuration, but may not be universal. For example I imagine testing on all SSD array may have very different result. But I personally think in most cases IO bandwidth is more scarce resource than CPU. Signed-off-by: Eric Mei <eric.mei@seagate.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:43 +10:00
NeilBrown	edbe83ab4c	md/raid5: allow the stripe_cache to grow and shrink. The default setting of 256 stripe_heads is probably much too small for many configurations. So it is best to make it auto-configure. Shrinking the cache under memory pressure is easy. The only interesting part here is that we put a fairly high cost ('seeks') on shrinking the cache as the cost is greater than just having to read more data, it reduces parallelism. Growing the cache on demand needs to be done carefully. If we allow fast growth, that can upset memory balance as lots of dirty memory can quickly turn into lots of memory queued in the stripe_cache. It is important for the raid5 block device to appear congested to allow write-throttling to work. So we only add stripes slowly. We set a flag when an allocation fails because all stripes are in use, allocate at a convenient time when that flag is set, and don't allow it to be set again until at least one stripe_head has been released for re-use. This means that a spurt of requests will only cause one stripe_head to be allocated, but a steady stream of requests will slowly increase the cache size - until memory pressure puts it back again. It could take hours to reach a steady state. The value written to, and displayed in, stripe_cache_size is used as a minimum. The cache can grow above this and shrink back down to it. The actual size is not directly visible, though it can be deduced to some extent by watching stripe_cache_active. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:43 +10:00
NeilBrown	5423399a84	md/raid5: change ->inactive_blocked to a bit-flag. This allows us to easily add more (atomic) flags. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:43 +10:00
NeilBrown	486f0644c3	md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe() succeeds, do it inside the functions. Also choose the correct hash to handle next inside the functions. This removes duplication and will help with future new uses of {grow,drop}_one_stripe. This also fixes a minor bug where the "md/raid:%md: allocate XXkB" message always said "0kB". Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
NeilBrown	a9683a795b	md/raid5: pass gfp_t arg to grow_one_stripe() This is needed for future improvement to stripe cache management. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
Markus Stockhausen	d06f191f8e	md/raid5: introduce configuration option rmw_level Depending on the available coding we allow optimized rmw logic for write operations. To support easier testing this patch allows manual control of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level. The configuration can handle three levels of control. rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q calculation has no implementation path yet to factor in/out chunks of a syndrome. Enforcing this level can be benefical for slow CPUs with hardware syndrome support and fast SSDs. rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will save IOs. This equals the "old" unpatched behaviour and will be the default. rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are equal. We might have higher CPU consumption because of calculating the parity twice but it can be benefical otherwise. E.g. RAID4 with fast dedicated parity disk/SSD. The option is implemented just to be forward-looking and will ONLY work with this patch! Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
Markus Stockhausen	584acdd49c	md/raid5: activate raid6 rmw feature Glue it altogehter. The raid6 rmw path should work the same as the already existing raid5 logic. So emulate the prexor handling/flags and split functions as needed. 1) Enable xor_syndrome() in the async layer. 2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome at the start of a rmw run as we did it before for the single parity. 3) Take care of rmw run in ops_run_reconstruct6(). Again process only the changed pages to get syndrome back into sync. 4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw run. The lower layers will calculate start & end pages from that and call the xor_syndrome() correspondingly. 5) Adapt the several places where we ignored Q handling up to now. Performance numbers for a single E5630 system with a mix of 10 7200k desktop/server disks. 300 seconds random write with 8 threads onto a 3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4) bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0 skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0 4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s 8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s 16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s 32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s 64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s 128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s 256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s 512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s Signed-off-by: Markus Stockhausen <stockhausen@collogia.de> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:42 +10:00
shli@kernel.org	dabc4ec6ba	raid5: handle expansion/resync case with stripe batching expansion/resync can grab a stripe when the stripe is in batch list. Since all stripes in batch list must be in the same state, we can't allow some stripes run into expansion/resync. So we delay expansion/resync for stripe in batch list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	72ac733015	raid5: handle io error of batch list If io error happens in any stripe of a batch list, the batch list will be split, then normal process will run for the stripes in the list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	59fc630b8b	RAID5: batch adjacent full stripe write stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k unit. Idealy we should use big size for adjacent full stripe writes. Bigger stripe cache size means less stripes runing in the state machine so can reduce cpu overhead. And also bigger size can cause bigger IO size dispatched to under layer disks. With below patch, we will automatically batch adjacent full stripe write together. Such stripes will be added to the batch list. Only the first stripe of the list will be put to handle_list and so run handle_stripe(). Some steps of handle_stripe() are extended to cover all stripes of the list, including ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes running in handle_stripe() and we send IO of whole stripe list together to increase IO size. Stripes added to a batch list have some limitations. A batch list can only include full stripe write and can't cross chunk boundary to make sure stripes have the same parity disks. Stripes in a batch list must be in the same state (no written, toread and so on). If a stripe is in a batch list, all new read/write to add_stripe_bio will be blocked to overlap conflict till the batch list is handled. The limitations will make sure stripes in a batch list be in exactly the same state in the life circly. I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6 PCIe SSD. This patch improves around 30% performance and IO size to under layer disk is exactly 32k. I also run a 4k randwrite test in the same array to make sure the performance isn't changed with the patch. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	7a87f43405	raid5: track overwrite disk count Track overwrite disk count, so we can know if a stripe is a full stripe write. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	da41ba6597	raid5: add a new flag to track if a stripe can be batched A freshly new stripe with write request can be batched. Any time the stripe is handled or new read is queued, the flag will be cleared. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
shli@kernel.org	46d5b78562	raid5: use flex_array for scribble data Use flex_array for scribble data. Next patch will batch several stripes together, so scribble data should be able to cover several stripes, so this patch also allocates scribble data for stripes across a chunk. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
Heinz Mauelshagen	753f2856cd	md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid The patch makes 3 references to mddev->queue in the raid0 personality conditional in order to allow for it to be accessed from dm-raid. Mandatory, because md instances underneath dm-raid don't manage a request queue of their own which'd lead to oopses without the patch. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:41 +10:00
NeilBrown	ac8fa4196d	md: allow resync to go faster when there is competing IO. When md notices non-sync IO happening while it is trying to resync (or reshape or recover) it slows down to the set minimum. The default minimum might have made sense many years ago but the drives have become faster. Changing the default to match the times isn't really a long term solution. This patch changes the code so that instead of waiting until the speed has dropped to the target, it just waits until pending requests have completed. This means that the delay inserted is a function of the speed of the devices. Testing shows that: - for some loads, the resync speed is unchanged. For those loads increasing the minimum doesn't change the speed either. So this is a good result. To increase resync speed under such loads we would probably need to increase the resync window size. - for other loads, resync speed does increase to a reasonable fraction (e.g. 20%) of maximum possible, and throughput of the load only drops a little bit (e.g. 10%) - for other loads, throughput of the non-sync load drops quite a bit more. These seem to be latency-sensitive loads. So it isn't a perfect solution, but it is mostly an improvement. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00
NeilBrown	09314799e4	md: remove 'go_faster' option from ->sync_request() This option is not well justified and testing suggests that it hardly ever makes any difference. The comment suggests there might be a need to wait for non-resync activity indicated by ->nr_waiting, however raise_barrier() already waits for all of that. So just remove it to simplify reasoning about speed limiting. This allows us to remove a 'FIXME' comment from raid5.c as that never used the flag. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00
NeilBrown	50c37b136a	md: don't require sync_min to be a multiple of chunk_size. There is really no need for sync_min to be a multiple of chunk_size, and values read from here often aren't. That means you cannot read a value and expect to be able to write it back later. So remove the chunk_size check, and round down to a multiple of 4K, to be sure everything works with 4K-sector devices. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00
NeilBrown	d51e4fe6d6	Merge branch 'cluster' into for-next	2015-04-22 08:00:20 +10:00
Goldwyn Rodrigues	97f6cd39da	md-cluster: re-add capabilities When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state, the clustered md: 1. Sends RE_ADD message with the desc_nr. Nodes receiving the message clear the Faulty bit in their respective rdev->flags. 2. The node initiating re-add, gathers the bitmaps of all nodes and copies them into the local bitmap. It does not clear the bitmap from which it is copying. 3. Initiating node schedules a md recovery to sync the devices. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues	a6da4ef85c	md: re-add a failed disk This adds the capability of re-adding a failed disk by writing "re-add" to /sys/block/mdXX/md/dev-YYY/state. This facilitates adding disks which have encountered a temporary error such as a network disconnection/hiccup in an iSCSI device, or a SAN cable disconnection which has been restored. In such a situation, you do not need to remove and re-add the device. Writing re-add to the failed device's state would add it again to the array and perform the recovery of only the blocks which were written after the device failed. This works for generic md, and is not related to clustering. However, this patch is to ease re-add operations listed above in clustering environments. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues	88bcfef7be	md-cluster: remove capabilities This adds "remove" capabilities for the clustered environment. When a user initiates removal of a device from the array, a REMOVE message with disk number in the array is sent to all the nodes which kick the respective device in their own array. This facilitates the removal of failed devices. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 07:59:39 +10:00

... 3 4 5 6 7 ...

4065 Commits