OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Jes Sorensen	681ab46960	md/raid10: submit_bio_wait() returns 0 on success This was introduced with `9e882242c6` which changed the return value of submit_bio_wait() to return != 0 on error, but didn't update the caller accordingly. Fixes: `9e882242c6` ("block: Add submit_bio_wait(), remove from md") Cc: stable@vger.kernel.org (v3.10) Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-21 07:24:29 +11:00
NeilBrown	c2a06c38d9	Merge branch 'md-next' of git://github.com/goldwynr/linux into for-next md-cluster: A better way for METADATA_UPDATED processing The processing of METADATA_UPDATED message is too simple and prone to errors. Besides, it would not update the internal data structures as required. This set of patches reads the superblock from one of the device of the MD and checks for changes in the in-memory data structures. If there is a change, it performs the necessary actions to keep the internal data structures as it would be in the primary node. An example is if a devices turns faulty. The algorithm is: 1. The initiator node marks the device as faulty and updates the superblock 2. The initiator node sends METADATA_UPDATED with an advisory device number to the rest of the nodes. 3. The receiving node on receiving the METADATA_UPDATED message 3.1 Reads the superblock 3.2 Detects a device has failed by comparing with memory structure 3.3 Calls the necessary functions to record the failure and get the device out of the active array. 3.4 Acknowledges the message. The patch series also fixes adding the disk which was impacted because of the changes. Patches can also be found at https://github.com/goldwynr/linux branch md-next Changes since V2: - Fix status synchrnoization after --add and --re-add operations - Included Guoqing's patches on endian correctness, zeroing cmsg etc - Restructure add_new_disk() and cancel()	2015-10-14 07:09:52 +11:00
Goldwyn Rodrigues	c40f341f1e	md-cluster: Use a small window for resync Suspending the entire device for resync could take too long. Resync in small chunks. cluster's resync window (32M) is maintained in r1conf as cluster_sync_low and cluster_sync_high and processed in raid1's sync_request(). If the current resync is outside the cluster resync window: 1. Set the cluster_sync_low to curr_resync_completed. 2. Check if the sync will fit in the new window, if not issue a wait_barrier() and set cluster_sync_low to sector_nr. 3. Set cluster_sync_high to cluster_sync_low + resync_window. 4. Send a message to all nodes so they may add it in their suspension list. bitmap_cond_end_sync is modified to allow to force a sync inorder to get the curr_resync_completed uptodate with the sector passed. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-10-12 01:32:05 -05:00
Mikulas Patocka	a452744bcb	crash in md-raid1 and md-raid10 due to incorrect list manipulation The commit `55ce74d4bf` (md/raid1: ensure device failure recorded before write request returns) is causing crash in the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100% reproducible. The reason for the crash is that the newly added code in raid1d moves the list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and then incorrectly pops the bio from conf->bio_end_io_list (which is empty because the list was alrady moved). Raid-10 has a similar bug. Kernel Fault: Code=15 regs=000000006ccb8640 (Addr=0000000100000000) CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35 task: 000000006cc1f258 ti: 000000006ccb8000 task.ti: 000000006ccb8000 YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00001000000001001111111000001111 Not tainted r00-03 000000ff0804fe0f 000000001059d000 000000001059f818 000000007f16be38 r04-07 000000001059d000 000000007f16be08 0000000000200200 0000000000000001 r08-11 000000006ccb8260 000000007b7934d0 0000000000000001 0000000000000000 r12-15 000000004056f320 0000000000000000 0000000000013dd0 0000000000000000 r16-19 00000000f0d00ae0 0000000000000000 0000000000000000 0000000000000001 r20-23 000000000800000f 0000000042200390 0000000000000000 0000000000000000 r24-27 0000000000000001 000000000800000f 000000007f16be08 000000001059d000 r28-31 0000000100000000 000000006ccb8560 000000006ccb8640 0000000000000000 sr00-03 0000000000249800 0000000000000000 0000000000000000 0000000000249800 sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001059f61c 000000001059f620 IIR: 0f8010c6 ISR: 0000000000000000 IOR: 0000000100000000 CPU: 3 CR30: 000000006ccb8000 CR31: 0000000000000000 ORIG_R28: 000000001059d000 IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1] IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1] RP(r2): raid_end_bio_io+0x88/0x168 [raid1] Backtrace: [<000000001059f818>] raid_end_bio_io+0x88/0x168 [raid1] [<00000000105a4f64>] raid1d+0x144/0x1640 [raid1] [<000000004017fd5c>] kthread+0x144/0x160 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `55ce74d4bf` ("md/raid1: ensure device failure recorded before write request returns.") Fixes: `95af587e95` ("md/raid10: ensure device failure recorded before write request returns.") Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-09 08:33:46 +11:00
Julia Lawall	644df1a85f	md: drop null test before destroy functions Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) \(kmem_cache_destroy\\|mempool_destroy\\|dma_pool_destroy\)(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-02 17:23:44 +10:00
NeilBrown	e89c6fdf9e	Merge linux-block/for-4.3/core into md/for-linux There were a few conflicts that are fairly easy to resolve. Signed-off-by: NeilBrown <neilb@suse.com>	2015-09-05 11:08:32 +02:00
Linus Torvalds	1081230b74	Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...	2015-09-02 13:10:25 -07:00
NeilBrown	95af587e95	md/raid10: ensure device failure recorded before write request returns. When a write to one of the legs of a RAID10 fails, the failure is recorded in the metadata of the other legs so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:43:45 +02:00
NeilBrown	02ec50265b	md/raid10: fix a few typos in comments Signed-off-by: NeilBrown <neilb@suse.com>	2015-08-31 19:32:09 +02:00
Kent Overstreet	8ae126660f	block: kill merge_bvec_fn() completely As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:57 -06:00
Jens Axboe	b7c44ed9d2	block: manipulate bio->bi_flags through helpers Some places use helpers now, others don't. We only have the 'is set' helper, add helpers for setting and clearing flags too. It was a bit of a mess of atomic vs non-atomic access. With BIO_UPTODATE gone, we don't have any risk of concurrent access to the flags. So relax the restriction and don't make any of them atomic. The flags that do have serialization issues (reffed and chained), we already handle those separately. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-07-29 08:55:20 -06:00
Christoph Hellwig	4246a0b63b	block: add a bi_error field to struct bio Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-07-29 08:55:15 -06:00
NeilBrown	299b0685e3	md/raid10: always set reshape_safe when initializing reshape_position. 'reshape_position' tracks where in the reshape we have reached. 'reshape_safe' tracks where in the reshape we have safely recorded in the metadata. These are compared to determine when to update the metadata. So it is important that reshape_safe is initialised properly. Currently it isn't. When starting a reshape from the beginning it usually has the correct value by luck. But when reducing the number of devices in a RAID10, it has the wrong value and this leads to the metadata not being updated correctly. This can lead to corruption if the reshape is not allowed to complete. This patch is suitable for any -stable kernel which supports RAID10 reshape, which is 3.5 and later. Fixes: `3ea7daa5d7` ("md/raid10: add reshape support") Cc: stable@vger.kernel.org (v3.5+ please wait for -final to be out for 2 weeks) Signed-off-by: NeilBrown <neilb@suse.com>	2015-07-22 14:08:24 +10:00
Linus Torvalds	6aaf0da872	md updates for 4.2 A mixed bag - a few bug fixes - some performance improvement that decrease lock contention - some clean-up Nothing major. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJVi6weAAoJEDnsnt1WYoG50CsP/RqFbZicRSIvzXUURwP+yCP0 3YZuURj4IXC6Cy/HLX+bZoj1p/b+GIRsZ72fWFJrd2LheaAI6WojCCLlnmXUtI/Y LIppF8/A2hfCNbF9cILByvrbzfndeEGK8kvootBDpvD0jlYiGePPAMQY2zx0MAyb T4yJ/KiziLniP6x7vqZrQ6I1MRVjeanN6RWXktFtixMpNOKUJe3PiZbUz4VDIrHR DaiHCbMjvRIkUWgNY8HmijEt+c8AYia7muqLj359dy2xF1hlUIdCx+61cgFD1zd8 enKDH3xp+3B9BEgHe+AtxTAzpqSgU93tdhUjGcy/orA+yYjAAcA4ifngrzfE3VKb kwQgPh2JvUrubavrcto0hthS5RldrCpDXebOM4aEq+7lDHCwrZ39Qio5+1F7TLt5 A5E3Eb7dPRdp9T3LrluX8/f7bO/Wbmxvv/RwnSLTpnGQoBWIAqCpQ+e9ro446Gsx /phXv3tE78fKj88LgQY/mm8ICeCppmQGLrpmjk9bkaZzqFdzQoURVmPh8QPMuJB4 iMHpOOKLzrUlW/23rRxaIKwPuFyxlNuLAvyA3ezsymGiZ+SqSeFCEm1jN64EfMCI 39rpfZt2pcVVOZJ9YeuzZG9wpie96yGZgnVWlP3FPjqRpboXqmtHlYA6EMRtqDAy mjSiGDF2bxkT1/YcjELD =sXTI -----END PGP SIGNATURE----- Merge tag 'md/4.2' of git://neil.brown.name/md Pull md updates from Neil Brown: "A mixed bag - a few bug fixes - some performance improvement that decrease lock contention - some clean-up Nothing major" * tag 'md/4.2' of git://neil.brown.name/md: md: clear Blocked flag on failed devices when array is read-only. md: unlock mddev_lock on an error path. md: clear mddev->private when it has been freed. md: fix a build warning md/raid5: ignore released_stripes check md/raid5: per hash value and exclusive wait_for_stripe md/raid5: split wait_for_stripe and introduce wait_for_quiescent wait: introduce wait_event_exclusive_cmd md: convert to kstrto*() md/raid10: make sync_request_write() call bio_copy_data()	2015-06-29 11:10:56 -07:00
Linus Torvalds	e4bc13adfd	Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach\|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...	2015-06-25 16:00:17 -07:00
Kent Overstreet	c31df25f20	md/raid10: make sync_request_write() call bio_copy_data() Refactor sync_request_write() of md/raid10 to use bio_copy_data() instead of open coding bio_vec iterations. Cc: Christoph Hellwig <hch@infradead.org> Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <mlin@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-17 09:59:57 +10:00
NeilBrown	ea358cd0d2	md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync MD_RECOVERY_DONE is normally cleared by md_check_recovery after a resync etc finished. However it is possible for raid5_start_reshape to race and start a reshape before MD_RECOVERY_DONE is cleared. This can lean to multiple reshapes running at the same time, which isn't good. To make sure it is cleared before starting a reshape, and also clear it when reaping a thread, just to be safe. Signed-off-by: NeilBrown <neilb@suse.de>	2015-06-12 20:16:33 +10:00
Tejun Heo	4452226ea2	writeback: move backing_dev_info->state into bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->state into wb. * enum bdi_state is renamed to wb_state and the prefix of all enums is changed from BDI_ to WB_. * Explicit zeroing of bdi->state is removed without adding zeoring of wb->state as the whole data structure is zeroed on init anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: drbd-dev@lists.linbit.com Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:34 -06:00
NeilBrown	09314799e4	md: remove 'go_faster' option from ->sync_request() This option is not well justified and testing suggests that it hardly ever makes any difference. The comment suggests there might be a need to wait for non-resync activity indicated by ->nr_waiting, however raise_barrier() already waits for all of that. So just remove it to simplify reasoning about speed limiting. This allows us to remove a 'FIXME' comment from raid5.c as that never used the flag. Signed-off-by: NeilBrown <neilb@suse.de>	2015-04-22 08:00:40 +10:00
NeilBrown	f04ebb0be7	md/raid10: round up to bdev_logical_block_size in narrow_write_error. RAID10 version of earlier fix for RAID1. We must never initiate IO with sizes less that logical_block_size. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-16 14:51:54 +11:00
NeilBrown	53a6ab4d3f	md/raid10: fix conversion from RAID0 to RAID10 A RAID0 array (like a LINEAR array) does not have a concept of 'size' being the amount of each device that is in use. Rather, as much of each device as is available is used. So the 'size' is set to 0 and ignored. RAID10 does have this concept and needs it to be set correctly. So when we convert RAID0 to RAID10 we must determine the 'size' (that being the size of the first 'strip_zone' in the RAID0), and set it correctly. Reported-and-tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-12 14:09:57 +11:00
NeilBrown	afa0f557cb	md: rename ->stop to ->free Now that the ->stop function only frees the private data, rename is accordingly. Also pass in the private pointer as an arg rather than using mddev->private. This flexibility will be useful in level_store(). Finally, don't clear ->private. It doesn't make sense to clear it seeing that isn't what we free, and it is no longer necessary to clear ->private (it was some time ago before ->to_remove was introduced). Setting ->to_remove in ->free() is a bit of a wart, but not a big problem at the moment. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:52 +11:00
NeilBrown	5aa61f427e	md: split detach operation out from ->stop. Each md personality has a 'stop' operation which does two things: 1/ it finalizes some aspects of the array to ensure nothing is accessing the ->private data 2/ it frees the ->private data. All the steps in '1' can apply to all arrays and so can be performed in common code. This is useful as in the case where we change the personality which manages an array (in level_store()), it would be helpful to do step 1 early, and step 2 later. So split the 'step 1' functionality out into a new mddev_detach(). Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:52 +11:00
NeilBrown	64590f45dd	md: make merge_bvec_fn more robust in face of personality changes. There is no locking around calls to merge_bvec_fn(), so it is possible that calls which coincide with a level (or personality) change could go wrong. So create a central dispatch point for these functions and use rcu_read_lock(). If the array is suspended, reject any merge that can be rejected. If not, we know it is safe to call the function. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:52 +11:00
NeilBrown	5c675f83c6	md: make ->congested robust against personality changes. There is currently no locking around calls to the 'congested' bdi function. If called at an awkward time while an array is being converted from one level (or personality) to another, there is a tiny chance of running code in an unreferenced module etc. So add a 'congested' function to the md_personality operations structure, and call it with appropriate locking from a central 'mddev_congested'. When the array personality is changing the array will be 'suspended' so no IO is processed. If mddev_congested detects this, it simply reports that the array is congested, which is a safe guess. As mddev_suspend calls synchronize_rcu(), mddev_congested can avoid races by included the whole call inside an rcu_read_lock() region. This require that the congested functions for all subordinate devices can be run under rcu_lock. Fortunately this is the case. Signed-off-by: NeilBrown <neilb@suse.de>	2015-02-04 08:35:52 +11:00
NeilBrown	f72ffdd686	md: remove unwanted white space from md.c My editor shows much of this is RED. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:29 +11:00
NeilBrown	c4796e215f	md/raid10: another memory leak due to reshape. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-14 13:08:28 +11:00
NeilBrown	3fd83717e4	md: use set_bit/clear_bit instead of shift/mask for bi_flags changes. Using {set,clear}_bit is more consistent than shifting and masking. No functional change. Signed-off-by: NeilBrown <neilb@suse.de>	2014-10-09 10:07:04 +11:00
NeilBrown	cb8b12b5d8	md/raid10: always initialise ->state on newly allocated r10_bio Most places which allocate an r10_bio zero the ->state, some don't. As the r10_bio comes from a mempool, and the allocation function uses kzalloc it is often zero anyway. But sometimes it isn't and it is best to be safe. I only noticed this because of the bug fixed by an earlier patch where the r10_bios allocated for a reshape were left around to be used by a subsequent resync. In that case the R10BIO_IsReshape flag caused problems. Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-19 17:20:27 +10:00
NeilBrown	e337aead3a	md/raid10: avoid memory leak on error path during reshape. If raid10 reshape fails to find somewhere to read a block from, it returns without freeing memory... Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-19 17:20:27 +10:00
NeilBrown	b39685526f	md/raid10: Fix memory leak when raid10 reshape completes. When a raid10 commences a resync/recovery/reshape it allocates some buffer space. When a resync/recovery completes the buffer space is freed. But not when the reshape completes. This can result in a small memory leak. There is a subtle side-effect of this bug. When a RAID10 is reshaped to a larger array (more devices), the reshape is immediately followed by a "resync" of the new space. This "resync" will use the buffer space which was allocated for "reshape". This can cause problems including a "BUG" in the SCSI layer. So this is suitable for -stable. Cc: stable@vger.kernel.org (v3.5+) Fixes: `3ea7daa5d7` Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-19 17:20:27 +10:00
NeilBrown	ce0b0a4695	md/raid10: fix memory leak when reshaping a RAID10. raid10 reshape clears unwanted bits from a bio->bi_flags using a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC was added. Since then it clears that bit but shouldn't. This results in a memory leak. So change to used the approved method of clearing unwanted bits. As this causes a memory leak which can consume all of memory the fix is suitable for -stable. Fixes: `a38352e0ac` Cc: stable@vger.kernel.org (v3.10+) Reported-by: mdraid.pkoch@dfgh.net (Peter Koch) Signed-off-by: NeilBrown <neilb@suse.de>	2014-08-19 17:20:27 +10:00
NeilBrown	2446dba03f	md/raid1,raid10: always abort recover on write error. Currently we don't abort recovery on a write error if the write error to the recovering device was triggerd by normal IO (as opposed to recovery IO). This means that for one bitmap region, the recovery might write to the recovering device for a few sectors, then not bother for subsequent sectors (as it never writes to failed devices). In this case the bitmap bit will be cleared, but it really shouldn't. The result is that if the recovering device fails and is then re-added (after fixing whatever hardware problem triggerred the failure), the second recovery won't redo the region it was in the middle of, so some of the device will not be recovered properly. If we abort the recovery, the region being processes will be cancelled (bit not cleared) and the whole region will be retried. As the bug can result in data corruption the patch is suitable for -stable. For kernels prior to 3.11 there is a conflict in raid10.c which will require care. Original-from: jiao hui <jiaohui@bwstor.com.cn> Reported-and-tested-by: jiao hui <jiaohui@bwstor.com.cn> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org	2014-07-31 10:16:52 +10:00
NeilBrown	cc13b1d150	md/raid10: call wait_barrier() for each request submitted. wait_barrier() includes a counter, so we must call it precisely once (unless balanced by allow_barrier()) for each request submitted. Since commit `20d0189b10` block: Introduce new bio_split() in 3.14-rc1, we don't call it for the extra requests generated when we need to split a bio. When this happens the counter goes negative, any resync/recovery will never start, and "mdadm --stop" will hang. Reported-by: Chris Murphy <lists@colorremedies.com> Fixes: `20d0189b10` Cc: stable@vger.kernel.org (3.14+) Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-05-06 09:49:26 +10:00
Linus Torvalds	f568849eda	Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...	2014-01-30 11:19:05 -08:00
NeilBrown	0b59bb6422	md/raid10: avoid fullsync when not necessary. This is the raid10 equivalent of commit `4f0a5e012c` MD RAID1: Further conditionalize 'fullsync' If a device in a newly assembled array is not fully recovered we currently do a fully resync by don't need to. Signed-off-by: NeilBrown <neilb@suse.de>	2014-01-14 16:44:21 +11:00
NeilBrown	e8b8491585	md/raid10: fix bug when raid10 recovery fails to recover a block. commit `e875ecea26` md/raid10 record bad blocks as needed during recovery. added code to the "cannot recover this block" path to record a bad block rather than fail the whole recovery. Unfortunately this new case was placed after r10bio was freed rather than before, yet it still uses r10bio. This is will crash with a null dereference. So move the freeing of r10bio down where it is safe. Cc: stable@vger.kernel.org (v3.1+) Fixes: `e875ecea26` Reported-by: Damian Nowak <spam@nowaker.net> URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181 Signed-off-by: NeilBrown <neilb@suse.de>	2014-01-14 16:44:08 +11:00
NeilBrown	b50c259e25	md/raid10: fix two bugs in handling of known-bad-blocks. If we discover a bad block when reading we split the request and potentially read some of it from a different device. The code path of this has two bugs in RAID10. 1/ we get a spin_lock with _irq, but unlock without _irq!! 2/ The calculation of 'sectors_handled' is wrong, as can be clearly seen by comparison with raid1.c This leads to at least 2 warnings and a probable crash is a RAID10 ever had known bad blocks. Cc: stable@vger.kernel.org (v3.1+) Fixes: `856e08e237` Reported-by: Damian Nowak <spam@nowaker.net> URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181 Signed-off-by: NeilBrown <neilb@suse.de>	2014-01-14 16:44:07 +11:00
Kent Overstreet	20d0189b10	block: Introduce new bio_split() The new bio_split() can split arbitrary bios - it's not restricted to single page bios, like the old bio_split() (previously renamed to bio_pair_split()). It also has different semantics - it doesn't allocate a struct bio_pair, leaving it up to the caller to handle completions. Then convert the existing bio_pair_split() users to the new bio_split() - and also nvme, which was open coding bio splitting. (We have to take that BUG_ON() out of bio_integrity_trim() because this bio_split() needs to use it, and there's no reason it has to be used on bios marked as cloned; BIO_CLONED doesn't seem to have clearly documented semantics anyways.) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Neil Brown <neilb@suse.de>	2013-11-23 22:33:57 -08:00
Kent Overstreet	ee67891bf1	block: Rename bio_split() -> bio_pair_split() This is prep work for introducing a more general bio_split(). Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Sage Weil <sage@inktank.com>	2013-11-23 22:33:56 -08:00
Kent Overstreet	458b76ed2f	block: Kill bio_segments()/bi_vcnt usage When we start sharing biovecs, keeping bi_vcnt accurate for splits is going to be error prone - and unnecessary, if we refactor some code. So bio_segments() has to go - but most of the existing users just needed to know if the bio had multiple segments, which is easier - add a bio_multiple_segments() for them. (Two of the current uses of bio_segments() are going to go away in a couple patches, but the current implementation of bio_segments() is unsafe as soon as we start doing driver conversions for immutable biovecs - so implement a dumb version for bisectability, it'll go away in a couple patches) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com> Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com>	2013-11-23 22:33:51 -08:00
Kent Overstreet	4f024f3797	block: Abstract out bvec iterator Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6	2013-11-23 22:33:47 -08:00
Linus Torvalds	6d6e352c80	md update for 3.13. Mostly optimisations and obscure bug fixes. - raid5 gets less lock contention - raid1 gets less contention between normal-io and resync-io during resync. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUAUovzDznsnt1WYoG5AQJ1pQ//bDuXadoJ5dwjWjVxFOKoQ9j/9joEI0yH XTApD3ADKckdBc4TSLOIbCNLW1Pbe23HlOI/GjCiJ/7mePL3OwHd7Fx8Rfq3BubV f7NgjVwu8nwYD0OXEZsshImptEtrbYwQdy+qlKcHXcZz1MUfR+Egih3r/ouTEfEt FNq/6MpyN0IKSY82xP/jFZgesBucgKz/YOUIbwClxm7UiyISKvWQLBIAfLB3dyI3 HoEdEzQX6I56Rw0mkSUG4Mk+8xx/8twxL+yqEUqfdJREWuB56Km8kl8y/e465Nk0 ZZg6j/TrslVEwbEeVMx0syvYcaAWFZ4X2jdKfo1lI0g9beZp7H1GRF8yR1s2t/h4 g/vb55MEN++4LPaE9ut4z7SG2yLyGkZgFTzTjyq5of+DFL0cayO7wXxbgpcD7JYf Doef/OSa6csKiGiJI48iQa08Bolmz9ZWzZQXhAthKfFQ9Rv+GEtIAi4kLR8EZPbu 0/FL1ylYNUY9O7p0g+iy9Kcoc+xW36I95pPZf8pO8GFcXTjyuCCBVh/SNvFZZHPl 3xk3aZJknAEID8VrVG2IJPkeDI8WK8YxmpU/nARCoytn07Df6Ye8jGvLdR8pL3lB TIZV6eRY4yciB8LtoK9Kg4XTmOMhBtjt4c3znkljp98vhOQQb/oHN+BXMGcwqvr9 fk0KGrg31VA= =8RCg -----END PGP SIGNATURE----- Merge tag 'md/3.13' of git://neil.brown.name/md Pull md update from Neil Brown: "Mostly optimisations and obscure bug fixes. - raid5 gets less lock contention - raid1 gets less contention between normal-io and resync-io during resync" * tag 'md/3.13' of git://neil.brown.name/md: md/raid5: Use conf->device_lock protect changing of multi-thread resources. md/raid5: Before freeing old multi-thread worker, it should flush them. md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE. UAPI: include <asm/byteorder.h> in linux/raid/md_p.h raid1: Rewrite the implementation of iobarrier. raid1: Add some macros to make code clearly. raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array. raid1: Add a field array_frozen to indicate whether raid in freeze state. md: Convert use of typedef ctl_table to struct ctl_table md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes. md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. md: fix some places where mddev_lock return value is not checked. raid5: Retry R5_ReadNoMerge flag when hit a read error. raid5: relieve lock contention in get_active_stripe() raid5: relieve lock contention in get_active_stripe() wait: add wait_event_cmd() md/raid5.c: add proper locking to error path of raid5_start_reshape. md: fix calculation of stacking limits on level change. raid5: Use slow_path to release stripe when mddev->thread is null	2013-11-20 13:05:25 -08:00
NeilBrown	c91abf5a35	md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. We currently use kthread_should_stop() in various places in the sync/reshape code to abort early. However some places set MD_RECOVERY_INTR but don't immediately call md_reap_sync_thread() (and we will shortly get another one). When this happens we are relying on md_check_recovery() to reap the thread and that only happen when it finishes normally. So MD_RECOVERY_INTR must lead to a normal finish without the kthread_should_stop() test. So replace all relevant tests, and be more careful when the thread is interrupted not to acknowledge that latest step in a reshape as it may not be fully committed yet. Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop so we don't wait have to wait for the speed to drop before we can abort. Signed-off-by: NeilBrown <neilb@suse.de>	2013-11-19 15:19:17 +11:00
Kent Overstreet	6678d83f18	block: Consolidate duplicated bio_trim() implementations Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on, we should know better than this. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-08 09:02:31 -07:00
Lukasz Dorau	61e4947c99	md: Fix skipping recovery for read-only arrays. Since: commit `7ceb17e87b` md: Allow devices to be re-added to a read-only array. spares are activated on a read-only array. In case of raid1 and raid10 personalities it causes that not-in-sync devices are marked in-sync without checking if recovery has been finished. If a read-only array is degraded and one of its devices is not in-sync (because the array has been only partially recovered) recovery will be skipped. This patch adds checking if recovery has been finished before marking a device in-sync for raid1 and raid10 personalities. In case of raid5 personality such condition is already present (at raid5.c:6029). Bug was introduced in 3.10 and causes data corruption. Cc: stable@vger.kernel.org Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-10-24 12:55:17 +11:00
NeilBrown	0eb25bb027	md/raid10: remove use-after-free bug. We always need to be careful when calling generic_make_request, as it can start a chain of events which might free something that we are using. Here is one place I wasn't careful enough. If the wbio2 is not in use, then it might get freed at the first generic_make_request call. So perform all necessary tests first. This bug was introduced in 3.3-rc3 (`24afd80d99`) and can cause an oops, so fix is suitable for any -stable since then. Cc: stable@vger.kernel.org (3.3+) Signed-off-by: NeilBrown <neilb@suse.de>	2013-07-25 16:46:53 +10:00
NeilBrown	7bb23c4934	md/raid10: fix two problems with RAID10 resync. 1/ When an different between blocks is found, data is copied from one bio to the other. However bv_len is used as the length to copy and this could be zero. So use r10_bio->sectors to calculate length instead. Using bv_len was probably always a bit dubious, but the introduction of bio_advance made it much more likely to be a problem. 2/ When preparing some blocks for sync, we don't set BIO_UPTODATE except on bios that we schedule for a read. This ensures that missing/failed devices don't confuse the loop at the top of sync_request write. Commit `8be185f2c9` "raid10: Use bio_reset()" removed a loop which set BIO_UPTDATE on all appropriate bios. So we need to re-add that flag. These bugs were introduced in 3.10, so this patch is suitable for 3.10-stable, and can remove a potential for data corruption. Cc: stable@vger.kernel.org (3.10) Reported-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-07-18 14:18:01 +10:00
NeilBrown	1376512065	md/raid10: fix bug which causes all RAID10 reshapes to move no data. The recent comment: commit `7e83ccbecd` md/raid10: Allow skipping recovery when clean arrays are assembled Causes raid10 to skip a recovery in certain cases where it is safe to do so. Unfortunately it also causes a reshape to be skipped which is never safe. The result is that an attempt to reshape a RAID10 will appear to complete instantly, but no data will have been moves so the array will now contain garbage. (If nothing is written, you can recovery by simple performing the reverse reshape which will also complete instantly). Bug was introduced in 3.10, so this is suitable for 3.10-stable. Cc: stable@vger.kernel.org (3.10) Cc: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>	2013-07-04 16:42:57 +10:00
NeilBrown	78eaa0d4cb	md/raid10: fix two bugs affecting RAID10 reshape. 1/ If a RAID10 is being reshaped to a fewer number of devices and is stopped while this is ongoing, then when the array is reassembled the 'mirrors' array will be allocated too small. This will lead to an access error or memory corruption. 2/ A sanity test for a reshaping RAID10 array is restarted is slightly incorrect. Due to the first bug, this is suitable for any -stable kernel since 3.5 where this code was introduced. Cc: stable@vger.kernel.org (v3.5+) Signed-off-by: NeilBrown <neilb@suse.de>	2013-07-03 09:43:28 +10:00
NeilBrown	725d6e579f	md/raid10: check In_sync flag in 'enough()'. It isn't really enough to check that the rdev is present, we need to also be sure that the device is still In_sync. Doing this requires using rcu_dereference to access the rdev, and holding the rcu_read_lock() to ensure the rdev doesn't disappear while we look at it. Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-14 08:10:27 +10:00
NeilBrown	635f6416a2	md/raid10: locking changes for 'enough()'. As 'enough' accesses conf->prev and conf->geo, which can change spontanously, it should guard against changes. This can be done with device_lock as start_reshape holds device_lock while updating 'geo' and end_reshape holds it while updating 'prev'. So 'error' needs to hold 'device_lock'. On the other hand, raid10_end_read_request knows which of the two it really wants to access, and as it is an active request on that one, the value cannot change underneath it. So change _enough to take flag rather than a pointer, pass the appropriate flag from raid10_end_read_request(), and remove the locking. All other calls to 'enough' are made with reconfig_mutex held, so neither 'prev' nor 'geo' can change. Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-14 08:10:27 +10:00
Jonathan Brassow	9092c02d94	DM RAID: Add ability to restore transiently failed devices on resume DM RAID: Add ability to restore transiently failed devices on resume This patch adds code to the resume function to check over the devices in the RAID array. If any are found to be marked as failed and their superblocks can be read, an attempt is made to reintegrate them into the array. This allows the user to refresh the array with a simple suspend and resume of the array - rather than having to load a completely new table, allocate and initialize all the structures and throw away the old instantiation. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-14 08:10:24 +10:00
Linus Torvalds	82ea4be61f	A few bugfixes for md Some tagged for -stable. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUAUbl1mznsnt1WYoG5AQKGlQ//eixdawF+DUK5hadqZ9EDni+BAVzb7m69 +zU6ilQ7UOh7bxtAoJqrgFVykK+LG8wvYsEBwMjB9oRDLA96/YDXXiBzXHvd6mGh g271lwMTQ9h+O8L6psLUX6qsrH3i7SJmF8ySPKi6Fe5ruT8ToOB8Ii8XQebEZdXo VOzRz2VgSTcBdrTyKPDsBJByDQX36hsK8Gs5YSl5F3nvyV4dvGWMlyoTF1TRRt9K YCCZ8pSk3kTXaSdl0syrJxI17pEUC8mtcA01S6JD/GV49CGO8LYAckVJ4ijWw7VV IGGlH0DsYSMgJ7yyuLz4ifaqRnsWsAGW0WyiZYYKvjtNUiyBuBBbo2cQ1lNkR5p4 jnLhpJJVh0hLCPn6wcCWIBIdT/mFaBpXkvZPd3ks5kefGXsfpVPm0fK8r0fzkzgy tJCZtZFZHeK1qsgaDsiS76S2ZNcFh0HQVIa84Q200/XUDgh8dYlD0+7oIsVu0UBZ 72Aop+Ak9+k4vKTvB9/hpcY+Rt0MI7zKewXBDSDK1sXhIHLQqv8rCEeNYiuPPqr/ ghRukn+C/Wtr7JYBsX+jMjxtmSzYtwBOihwLoZCH9pp3C5jTvyQk9s8n1j13V2RK sAFtfpCVoQ8tTa7IITKRMfftzHn1WiPlPsj6VbigJ6A4N98csgv7x2rF7FyqcF0X aoj69nQ3i/4= =8iy3 -----END PGP SIGNATURE----- Merge tag 'md-3.10-fixes' of git://neil.brown.name/md Pull md bugfixes from Neil Brown: "A few bugfixes for md Some tagged for -stable" * tag 'md-3.10-fixes' of git://neil.brown.name/md: md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place md/raid1,raid10: use freeze_array in place of raise_barrier in various places. md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. md: md_stop_writes() should always freeze recovery.	2013-06-13 10:13:29 -07:00
H. Peter Anvin	5026d7a9b2	md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place There are cases where the kernel will believe that the WRITE SAME command is supported by a block device which does not, in fact, support WRITE SAME. This currently happens for SATA drivers behind a SAS controller, but there are probably a hundred other ways that can happen, including drive firmware bugs. After receiving an error for WRITE SAME the block layer will retry the request as a plain write of zeroes, but mdraid will consider the failure as fatal and consider the drive failed. This has the effect that all the mirrors containing a specific set of data are each offlined in very rapid succession resulting in data loss. However, just bouncing the request back up to the block layer isn't ideal either, because the whole initial request-retry sequence should be inside the write bitmap fence, which probably means that md needs to do its own conversion of WRITE SAME to write zero. Until the failure scenario has been sorted out, disable WRITE SAME for raid1, raid5, and raid10. [neilb: added raid5] This patch is appropriate for any -stable since 3.7 when write_same support was added. Cc: stable@vger.kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-13 14:49:54 +10:00
NeilBrown	e2d5992522	md/raid1,raid10: use freeze_array in place of raise_barrier in various places. Various places in raid1 and raid10 are calling raise_barrier when they really should call freeze_array. The former is only intended to be called from "make_request". The later has extra checks for 'nr_queued' and makes a call to flush_pending_writes(), so it is safe to call it from within the management thread. Using raise_barrier will sometimes deadlock. Using freeze_array should not. As 'freeze_array' currently expects one request to be pending (in handle_read_error - the only previous caller), we need to pass it the number of pending requests (extra) to ignore. The deadlock was made particularly noticeable by commits `050b66152f` (raid10) and `6b740b8d79` (raid1) which appeared in 3.4, so the fix is appropriate for any -stable kernel since then. This patch probably won't apply directly to some early kernels and will need to be applied by hand. Cc: stable@vger.kernel.org Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-13 13:40:48 +10:00
Alex Lyakas	3056e3aec8	md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. Without that fix, the following scenario could happen: - RAID1 with drives A and B; drive B was freshly-added and is rebuilding - Drive A fails - WRITE request arrives to the array. It is failed by drive A, so r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B succeeds in writing it, so the same r1_bio is marked as R1BIO_Uptodate. - r1_bio arrives to handle_write_finished, badblocks are disabled, md_error()->error() does nothing because we don't fail the last drive of raid1 - raid_end_bio_io() calls call_bio_endio() - As a result, in call_bio_endio(): if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) clear_bit(BIO_UPTODATE, &bio->bi_flags); this code doesn't clear the BIO_UPTODATE flag, and the whole master WRITE succeeds, back to the upper layer. So we returned success to the upper layer, even though we had written the data onto the rebuilding drive only. But when we want to read the data back, we would not read from the rebuilding drive, so this data is lost. [neilb - applied identical change to raid10 as well] This bug can result in lost data, so it is suitable for any -stable kernel. Cc: stable@vger.kernel.org Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-06-13 13:20:03 +10:00
Linus Torvalds	4de13d7aa8	Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block Pull block core updates from Jens Axboe: - Major bit is Kents prep work for immutable bio vecs. - Stable candidate fix for a scheduling-while-atomic in the queue bypass operation. - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging discard bios. - Tejuns changes to convert the writeback thread pool to the generic workqueue mechanism. - Runtime PM framework, SCSI patches exists on top of these in James' tree. - A few random fixes. * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits) relay: move remove_buf_file inside relay_close_buf partitions/efi.c: replace useless kzalloc's by kmalloc's fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read() block: fix max discard sectors limit blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Documentation: cfq-iosched: update documentation help for cfq tunables writeback: expose the bdi_wq workqueue writeback: replace custom worker pool implementation with unbound workqueue writeback: remove unused bdi_pending_list aoe: Fix unitialized var usage bio-integrity: Add explicit field for owner of bip_buf block: Add an explicit bio flag for bios that own their bvec block: Add bio_alloc_pages() block: Convert some code to bio_for_each_segment_all() block: Add bio_for_each_segment_all() bounce: Refactor __blk_queue_bounce to not use bi_io_vec raid1: use bio_copy_data() pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage pktcdvd: use bio_copy_data() block: Add bio_copy_data() ...	2013-05-08 10:13:35 -07:00
Shaohua Li	32f9f570d0	MD: ignore discard request for hard disks of hybid raid1/raid10 array In SSD/hard disk hybid storage, discard request should be ignored for hard disk. We used to be doing this way, but the unplug path forgets it. This is suitable for stable tree since v3.6. Cc: stable@vger.kernel.org Reported-and-tested-by: Markus <M4rkusXXL@web.de> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-04-30 14:49:36 +10:00
Hirokazu Takahashi	0fea7ed82b	md: raid1/raid10 md devices leak memory when stopping Hi. Raid1 and raid10 devices leak memory every time they stop. This is a patch for linux-3.9.0-rc7 to fix this problem. Thanks, Hirokazu Takahashi. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: NeilBrown <neilb@suse.de>	2013-04-30 14:49:26 +10:00
Martin Wilck	7e83ccbecd	md/raid10: Allow skipping recovery when clean arrays are assembled When an array is assembled incrementally with mdadm -I -R and the array switches to "active" mode, md starts a recovery. If the array was clean, the "fullsync" flag will be 0. Skip the full recovery in this case, as RAID1 does (the code was actually copied from the sync_request() method of RAID1). Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>	2013-04-24 11:42:42 +10:00
Kent Overstreet	8be185f2c9	raid10: Use bio_reset() More prep work for immutable bio vecs, mainly getting rid of references to bi_idx. bio_reset was being open coded in a few places. The one in sync_request was a bit nontrivial to convert, so could use some extra eyeballs. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> Acked-by: NeilBrown <neilb@suse.de>	2013-03-23 14:15:33 -07:00
Kent Overstreet	9e882242c6	block: Add submit_bio_wait(), remove from md Random cleanup - this code was duplicated and it's not really specific to md. Also added the ability to return the actual error code. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> Acked-by: Tejun Heo <tj@kernel.org>	2013-03-23 14:15:32 -07:00
Kent Overstreet	4f2ac93c17	block: Remove bi_idx references For immutable bvecs, all bi_idx usage needs to be audited - so here we're removing all the unnecessary uses. Most of these are places where it was being initialized on a bio that was just allocated, a few others are conversions to standard macros. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk>	2013-03-23 14:15:31 -07:00
Kent Overstreet	5b83636ae3	block: Change bio_split() to respect the current value of bi_idx In the current code bio_split() won't be seeing partially completed bios so this doesn't change any behaviour, but this makes the code a bit clearer as to what bio_split() actually requires. The immediate purpose of the patch is removing unnecessary bi_idx references, but the end goal is to allow partial completed bios to be submitted, which along with immutable biovecs enables effecient bio splitting. Some of the callers were (double) checking that bios could be split, so update their checks too. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Lars Ellenberg <drbd-dev@lists.linbit.com> CC: Neil Brown <neilb@suse.de> CC: Martin K. Petersen <martin.petersen@oracle.com>	2013-03-23 14:15:30 -07:00
Kent Overstreet	aa8b57aa3d	block: Use bio_sectors() more consistently Bunch of places in the code weren't using it where they could be - this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: "Ed L. Cashin" <ecashin@coraid.com> CC: Nick Piggin <npiggin@kernel.dk> CC: Jiri Kosina <jkosina@suse.cz> CC: Jim Paris <jim@jtan.com> CC: Geoff Levand <geoff@infradead.org> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ed Cashin <ecashin@coraid.com>	2013-03-23 14:15:30 -07:00
NeilBrown	ee0b024403	md/raid1,raid10: fix deadlock with freeze_array() When raid1/raid10 needs to fix a read error, it first drains all pending requests by calling freeze_array(). This calls flush_pending_writes() if it needs to sleep, but some writes may be pending in a per-process plug rather than in the per-array request queue. When raid1{,0}_unplug() moves the request from the per-process plug to the per-array request queue (from which flush_pending_writes() can flush them), it needs to wake up freeze_array(), or freeze_array() will never flush them and so it will block forever. So add the requires wake_up() calls. This bug was introduced by commit `f54a9d0e59` for raid1 and a similar commit for RAID10, and so has been present since linux-3.6. As the bug causes a deadlock I believe this fix is suitable for -stable. Cc: stable@vger.kernel.org (3.6.y 3.7.y 3.8.y) Reported-by: Tregaron Bayly <tbayly@bluehost.com> Tested-by: Tregaron Bayly <tbayly@bluehost.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-02-26 11:58:50 +11:00
Jonathan Brassow	9a3152ab02	MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2) MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2) This patch addresses raid arrays that have a number of devices that cannot be evenly divided by 'far_copies'. (E.g. 5 devices, far_copies = 2) This case must be handled differently because it causes that last set to be of a different size than the rest of the sets. We must compute a new modulo for this last set so that copied chunks are properly wrapped around. Example use_far_sets=1, far_copies=2, near_copies=1, devices=5: "far" algorithm dev1 dev2 dev3 dev4 dev5 ==== ==== ==== ==== ==== [ A B ] [ C D E ] [ G H ] [ I J K ] ... [ B A ] [ E C D ] --> nominal set of 2 and last set of 3 [ H G ] [ K I J ] []'s show far/offset sets Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-02-26 11:55:33 +11:00
Jonathan Brassow	475901aff1	MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1) The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe widths - copying them to a different location on the same devices after shifting the stripe. An example layout of each follows below: "far" algorithm dev1 dev2 dev3 dev4 dev5 dev6 ==== ==== ==== ==== ==== ==== A B C D E F G H I J K L ... F A B C D E --> Copy of stripe0, but shifted by 1 L G H I J K ... "offset" algorithm dev1 dev2 dev3 dev4 dev5 dev6 ==== ==== ==== ==== ==== ==== A B C D E F F A B C D E --> Copy of stripe0, but shifted by 1 G H I J K L L G H I J K ... Redundancy for these algorithms is gained by shifting the copied stripes one device to the right. This patch proposes that array be divided into sets of adjacent devices and when the stripe copies are shifted, they wrap on set boundaries rather than the array size boundary. That is, for the purposes of shifting, the copies are confined to their sets within the array. The sets are 'near_copies * far_copies' in size. The above "far" algorithm example would change to: "far" algorithm dev1 dev2 dev3 dev4 dev5 dev6 ==== ==== ==== ==== ==== ==== A B C D E F G H I J K L ... B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets H G J I L K Dev sets are 1-2, 3-4, 5-6 ... This has the affect of improving the redundancy of the array. We can always sustain at least one failure, but sometimes more than one can be handled. In the first examples, the pairs of devices that CANNOT fail together are: (1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs] In the example where the copies are confined to sets, the pairs of devices that cannot fail together are: (1,2) (3,4) (5,6) [20% of possible pairs] We cannot simply replace the old algorithms, so the 17th bit of the 'layout' variable is used to indicate whether we use the old or new method of computing the shift. (This is similar to the way the 16th bit indicates whether the "far" algorithm or the "offset" algorithm is being used.) This patch only handles the cases where the number of total raid disks is a multiple of 'far_copies'. A follow-on patch addresses the condition where this is not true. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-02-26 11:55:30 +11:00
Jonathan Brassow	4c0ca26bd2	MD RAID10: Minor non-functional code changes Changes include assigning 'addr' from 's' instead of 'sector' to be consistent with the way the code does it just a few lines later and using '%=' vs a conditional and subtraction. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-02-26 11:55:27 +11:00
Joe Lawrence	c8dc9c6547	md: raid1,10: Handle REQ_WRITE_SAME flag in write bios Set mddev queue's max_write_same_sectors to its chunk_sector value (before disk_stack_limits merges the underlying disk limits.) With that in place, be sure to handle writes coming down from the block layer that have the REQ_WRITE_SAME flag set. That flag needs to be copied into any newly cloned write bio. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de>	2013-02-26 11:55:21 +11:00
Linus Torvalds	9228ff9038	Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-block Pull block driver update from Jens Axboe: "Now that the core bits are in, here are the driver bits for 3.8. The branch contains: - A huge pile of drbd bits that were dumped from the 3.7 merge window. Following that, it was both made perfectly clear that there is going to be no more over-the-wall pulls and how the situation on individual pulls can be improved. - A few cleanups from Akinobu Mita for drbd and cciss. - Queue improvement for loop from Lukas. This grew into adding a generic interface for waiting/checking an even with a specific lock, allowing this to be pulled out of md and now loop and drbd is also using it. - A few fixes for xen back/front block driver from Roger Pau Monne. - Partition improvements from Stephen Warren, allowing partiion UUID to be used as an identifier." * 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits) drbd: update Kconfig to match current dependencies drbd: Fix drbdsetup wait-connect, wait-sync etc... commands drbd: close race between drbd_set_role and drbd_connect drbd: respect no-md-barriers setting also when changed online via disk-options drbd: Remove obsolete check drbd: fixup after wait_even_lock_irq() addition to generic code loop: Limit the number of requests in the bio list wait: add wait_event_lock_irq() interface xen-blkfront: free allocated page xen-blkback: move free persistent grants code block: partition: msdos: provide UUIDs for partitions init: reduce PARTUUID min length to 1 from 36 block: store partition_meta_info.uuid as a string cciss: use check_signature() cciss: cleanup bitops usage drbd: use copy_highpage drbd: if the replication link breaks during handshake, keep retrying drbd: check return of kmalloc in receive_uuids drbd: Broadcast sync progress no more often than once per second drbd: don't try to clear bits once the disk has failed ...	2012-12-17 13:39:11 -08:00
Lukas Czerner	eed8c02e68	wait: add wait_event_lock_irq() interface New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit moves the private wait_event_lock_irq() macro from MD to regular wait includes, introduces new macro wait_event_lock_irq_cmd() instead of using the old method with omitting cmd parameter which is ugly and makes a use of new macros in the MD. It also introduces the _interruptible_ variant. The use of new interface is when one have a special lock to protect data structures used in the condition, or one also needs to invoke "cmd" before putting it to sleep. All new macros are expected to be called with the lock taken. The lock is released before sleep and is reacquired afterwards. We will leave the macro with the lock held. Note to DM: IMO this should also fix theoretical race on waitqueue while using simultaneously wait_event_lock_irq() and wait_event() because of lack of locking around current state setting and wait queue removal. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: David Howells <dhowells@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-11-30 11:47:57 +01:00
NeilBrown	874807a831	md/raid1{,0}: fix deadlock in bitmap_unplug. If the raid1 or raid10 unplug function gets called from a make_request function (which is very possible) when there are bios on the current->bio_list list, then it will not be able to successfully call bitmap_unplug() and it could need to submit more bios and wait for them to complete. But they won't complete while current->bio_list is non-empty. So detect that case and handle the unplugging off to another thread just like we already do when called from within the scheduler. RAID1 version of bug was introduced in 3.6, so that part of fix is suitable for 3.6.y. RAID10 part won't apply. Cc: stable@vger.kernel.org Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com> Reported-by: Peter Maloney <peter.maloney@brockmann-consult.de> Signed-off-by: NeilBrown <neilb@suse.de>	2012-11-27 12:14:40 +11:00
NeilBrown	884162df2a	md/raid10: decrement correct pending counter when writing to replacement. When a write to a replacement device completes, we carefully and correctly found the rdev that the write actually went to and the blithely called rdev_dec_pending on the primary rdev, even if this write was to the replacement. This means that any writes to an array while a replacement was ongoing would cause the nr_pending count for the primary device to go negative, so it could never be removed. This bug has been present since replacement was introduced in 3.3, so it is suitable for any -stable kernel since then. Reported-by: "George Spelvin" <linux@horizon.com> Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2012-11-22 15:12:42 +11:00
NeilBrown	e7c0c3fa29	md/raid10: close race that lose writes lost when replacement completes. When a replacement operation completes there is a small window when the original device is marked 'faulty' and the replacement still looks like a replacement. The faulty should be removed and the replacement moved in place very quickly, bit it isn't instant. So the code write out to the array must handle the possibility that the only working device for some slot in the replacement - but it doesn't. If the primary device is faulty it just gives up. This can lead to corruption. So make the code more robust: if either the primary or the replacement is present and working, write to them. Only when neither are present do we give up. This bug has been present since replacement was introduced in 3.3, so it is suitable for any -stable kernel since then. Reported-by: "George Spelvin" <linux@horizon.com> Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2012-11-22 15:12:36 +11:00
Jonathan Brassow	ed30be077e	MD RAID10: Fix oops when creating RAID10 arrays via dm-raid.c Commit `2863b9eb` didn't take into account the changes to add TRIM support to RAID10 (commit `532a2a3fb`). That is, when using dm-raid.c to create the RAID10 arrays, there is no mddev->gendisk or mddev->queue. The code added to support TRIM simply assumes that mddev->queue is available without checking. The result is an oops any time dm-raid.c attempts to create a RAID10 device. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-31 11:42:30 +11:00
Linus Torvalds	9db908806b	md updates for 3.7 "discard" support, some dm-raid improvements and other assorted bits and pieces. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIVAwUAUHk6Rjnsnt1WYoG5AQKovQ//Ym0ROo5a6uekb2USLyFSdQH3TC7z0v0+ +kujrgoc4nHZU/vj5yfMvPVomEUsAhHEwTkvvCiXFFHn6cxPzC8ezm8d40xEeISX qp6i2bPlvGURhsW1tYeD+THtY82/oyzQ4Wa/vaE1sjVLQ+caa2q7kVVgAL9Bj/Kz aESIZjAuPxQNE1674/KR0EmMFcbpd0z1WDV+ydKlRV5jHCHGYf8OmxOenJFf+V/b /f9p2u+NUq5BN5WLhThcysO8lPX1Y7GG8IYay3DlSt/crU24R2a2j0qh/BDoK8+t /DceoHipbIiGxXLVjM7y+1RwPpCh75HJSZQHltPype2Z3iwtwEth9uTkEE3M2h/W tOQEbOZku0kcgsrys7JBmpkBwkR9oZqq1kDd4YBzqW4PiGVP6z0JRH8QpjjB+mjN 47ODYIZcaEYZ+0Jj8kcVxo3gv4Xj4DWH+auSNZihTVmjQPVqrcy3CAt3CkuDzTkY 34fZVuCDiCetLGCGQKrwfMDnySVy5xOmtC6iWsEY5rExAeb0E+BCzcBvbAXzt+ef MPDsrxWbo/ZkvpuwXOwLFTccBuRtAsFi7CM4jcow53W6XMnPpdubphNw5nylaEm1 DEzfID58mv8VHWRuW15vr7SbtROjYJkEFCIaEK3oprrRUYftZntIABcknqvcIYR+ /ULNzkRU1w4= =XRmL -----END PGP SIGNATURE----- Merge tag 'md-3.7' of git://neil.brown.name/md Pull md updates from NeilBrown: - "discard" support, some dm-raid improvements and other assorted bits and pieces. * tag 'md-3.7' of git://neil.brown.name/md: (29 commits) md: refine reporting of resync/reshape delays. md/raid5: be careful not to resize_stripes too big. md: make sure manual changes to recovery checkpoint are saved. md/raid10: use correct limit variable md: writing to sync_action should clear the read-auto state. Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races md/raid5: make sure to_read and to_write never go negative. md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write. md/raid5: protect debug message against NULL derefernce. md/raid5: add some missing locking in handle_failed_stripe. MD: raid5 avoid unnecessary zero page for trim MD: raid5 trim support md/bitmap:Don't use IS_ERR to judge alloc_page(). md/raid1: Don't release reference to device while handling read error. raid: replace list_for_each_continue_rcu with new interface add further __init annotations to crypto/xor.c DM RAID: Fix for "sync" directive ineffectiveness DM RAID: Fix comparison of index and quantity for "rebuild" parameter DM RAID: Add rebuild capability for RAID10 DM RAID: Move 'rebuild' checking code to its own function ...	2012-10-13 13:22:01 -07:00
Dan Carpenter	91502f099d	md/raid10: use correct limit variable Clang complains that we are assigning a variable to itself. This should be using bad_sectors like the similar earlier check does. Bug has been present since 3.1-rc1. It is minor but could conceivably cause corruption or other bad behaviour. Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 14:20:58 +11:00
Jianpeng Ma	7f7583d420	Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races Now that multiple threads can handle stripes, it is safer to use an atomic64_t for resync_mismatches, to avoid update races. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 14:17:59 +11:00
Jonathan Brassow	2863b9eb44	MD RAID10: Prep for DM RAID10 device replacement capability MD RAID10: Fix a couple potential kernel panics if RAID10 is used by dm-raid When device-mapper uses the RAID10 personality through dm-raid.c, there is no 'gendisk' structure in mddev and some sysfs information is also not populated. This patch avoids touching those non-existent structures. Signed-off-by: Jonathan Brassow <jbrassow@rehdat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 13:38:58 +11:00
Shaohua Li	4ed8731d8e	MD: change the parameter of md thread Change the thread parameter, so the thread can carry extra info. Next patch will use it. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 13:34:00 +11:00
NeilBrown	57c67df488	md/raid10: submit IO from originating thread instead of md thread. queuing writes to the md thread means that all requests go through the one processor which may not be able to keep up with very high request rates. So use the plugging infrastructure to submit all requests on unplug. If a 'schedule' is needed, we fall back on the old approach of handing the requests to the thread for it to handle. This is nearly identical to a recent patch which provided similar functionality to RAID1. Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 13:32:13 +11:00
Shaohua Li	532a2a3fba	md: raid 10 supports TRIM This makes md raid 10 support TRIM. If one disk supports discard and another not, or one has discard_zero_data and another not, there could be inconsistent between data from such disks. But this should not matter, discarded data is useless. This will add extra copy in rebuild though. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-10-11 13:30:52 +11:00
NeilBrown	80b4812407	md/raid10: fix "enough" function for detecting if array is failed. The 'enough' function is written to work with 'near' arrays only in that is implicitly assumes that the offset from one 'group' of devices to the next is the same as the number of copies. In reality it is the number of 'near' copies. So change it to make this number explicit. This bug makes it possible to run arrays without enough drives present, which is dangerous. It is appropriate for an -stable kernel, but will almost certainly need to be modified for some of them. Cc: stable@vger.kernel.org Reported-by: Jakub Husák <jakub@gooseman.cz> Signed-off-by: NeilBrown <neilb@suse.de>	2012-09-27 12:35:21 +10:00
NeilBrown	e0ee778528	md/raid10: fix problem with on-stack allocation of r10bio structure. A 'struct r10bio' has an array of per-copy information at the end. This array is declared with size [0] and r10bio_pool_alloc allocates enough extra space to store the per-copy information depending on the number of copies needed. So declaring a 'struct r10bio on the stack isn't going to work. It won't allocate enough space, and memory corruption will ensue. So in the two places where this is done, declare a sufficiently large structure and use that instead. The two call-sites of this bug were introduced in 3.4 and 3.5 so this is suitable for both those kernels. The patch will have to be modified for 3.4 as it only has one bug. Cc: stable@vger.kernel.org Reported-by: Ivan Vasilyev <ivan.vasilyev@gmail.com> Tested-by: Ivan Vasilyev <ivan.vasilyev@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-08-18 09:51:42 +10:00
Linus Torvalds	eff0d13f38	Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-block Pull block driver changes from Jens Axboe: - Making the plugging support for drivers a bit more sane from Neil. This supersedes the plugging change from Shaohua as well. - The usual round of drbd updates. - Using a tail add instead of a head add in the request completion for ndb, making us find the most completed request more quickly. - A few floppy changes, getting rid of a duplicated flag and also running the floppy init async (since it takes forever in boot terms) from Andi. * 'for-3.6/drivers' of git://git.kernel.dk/linux-block: floppy: remove duplicated flag FD_RAW_NEED_DISK blk: pass from_schedule to non-request unplug functions. block: stack unplug blk: centralize non-request unplug handling. md: remove plug_cnt feature of plugging. block/nbd: micro-optimization in nbd request completion drbd: announce FLUSH/FUA capability to upper layers drbd: fix max_bio_size to be unsigned drbd: flush drbd work queue before invalidate/invalidate remote drbd: fix potential access after free drbd: call local-io-error handler early drbd: do not reset rs_pending_cnt too early drbd: reset congestion information before reporting it in /proc/drbd drbd: report congestion if we are waiting for some userland callback drbd: differentiate between normal and forced detach drbd: cleanup, remove two unused global flags floppy: Run floppy initialization asynchronous	2012-08-01 09:06:47 -07:00
NeilBrown	0021b7bc04	md: remove plug_cnt feature of plugging. This seemed like a good idea at the time, but after further thought I cannot see it making a difference other than very occasionally and testing to try to exercise the case it is most likely to help did not show any performance difference by removing it. So remove the counting of active plugs and allow 'pending writes' to be activated at any time, not just when no plugs are active. This is only relevant when there is a write-intent bitmap, and the updating of the bitmap will likely introduce enough delay that the single-threading of bitmap updates will be enough to collect large numbers of updates together. Removing this will make it easier to centralise the unplug code, and will clear the other for other unplug enhancements which have a measurable effect. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-07-31 09:08:14 +02:00
Jonathan Brassow	cc4d1efdd0	MD RAID10: Export md_raid10_congested md/raid10: Export is_congested test. In similar fashion to commits `11d8a6e371` `1ed7242e59` we export the RAID10 congestion checking function so that dm-raid.c can make use of it and make use of the personality. The 'queue' and 'gendisk' structures will not be available to the MD code when device-mapper sets up the device, so we conditionalize access to these fields also. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:03:53 +10:00
Jonathan Brassow	473e87ce48	MD: Move macros from raid1.h to raid1.c MD RAID1/RAID10: Move some macros from .h file to .c file There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined in both raid1.h and raid10.h. They are only used in there respective .c files. However, if we wish to make RAID10 accessible to the device-mapper RAID target (dm-raid.c), then we need to move these macros into the .c files where they are used so that they do not conflict with each other. The macros from the two files are identical and could be moved into md.h, but I chose to leave the duplication and have them remain in the personality files. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:03:52 +10:00
Jonathan Brassow	dc280d987f	MD RAID10: rename mirror_info structure MD RAID10: Rename the structure 'mirror_info' to 'raid10_info' The same structure name ('mirror_info') is used by raid1. Each of these structures are defined in there respective header files. If dm-raid is to support both RAID1 and RAID10, the header files will be included and the structure names must not collide. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:03:52 +10:00
Jonathan Brassow	3bbae04b12	MD RAID10: Fix compiler warning. MD RAID10: Fix compiler warning. Initialize variable to prevent compiler warning. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-31 10:03:52 +10:00
NeilBrown	10684112c9	md/raid10: fix careless build error build error introduced by commit `b357f04a67` That function doesn't get extra args until a later patch. Bother. Reported-by: Fengguang Wu <wfg@linux.intel.com> Reported-by: Simon Kirby <sim@hostway.ca> Reported-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-04 09:35:35 +10:00
NeilBrown	b357f04a67	md: fix up plugging (again). The value returned by "mddev_check_plug" is only valid until the next 'schedule' as that will unplug things. This could happen at any call to mempool_alloc. So just calling mddev_check_plug at the start doesn't really make sense. So call it just before, or just after, queuing things for the thread. As the action that happens at unplug is to wake the thread, this makes lots of sense. If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we wake thread immediately. RAID5 is a bit different. Requests are queued for the thread and the thread is woken by release_stripe. So we don't need to wake the thread on failure. However the thread doesn't perform certain actions when there is any active plug, so it is important to install a plug before waking the thread. So for RAID5 we install the plug before queuing the request and waking the thread. Without this patch it is possible for raid1 or raid10 to queue a request without then waking the thread, resulting in the array locking up. Also change raid10 to only flush_pending_write when there are not active plugs, just like raid1. This patch is suitable for 3.0 or later. I plan to submit it to -stable, but I'll like to let it spend a few weeks in mainline first to be sure it is completely safe. Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 17:45:31 +10:00
NeilBrown	0232605d98	md: make 'name' arg to md_register_thread non-optional. Having the 'name' arg optional and defaulting to the current personality name is no necessary and leads to errors, as when changing the level of an array we can end up using the name of the old level instead of the new one. So make it non-optional and always explicitly pass the name of the level that the array will be. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 15:56:52 +10:00
NeilBrown	055d3747db	md/raid10: fix failure when trying to repair a read error. commit `58c54fcca3` md/raid10: handle further errors during fix_read_error better. in 3.1 added "r10_sync_page_io" which takes an IO size in sectors. But we were passing the IO size in bytes!!! This resulting in bio_add_page failing, and empty request being sent down, and a consequent BUG_ON in scsi_lib. [fix missing space in error message at same time] This fix is suitable for 3.1.y and later. Cc: stable@vger.kernel.org Reported-by: Christian Balzer <chibi@gol.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 15:55:33 +10:00
NeilBrown	fc448a18ae	md/raid10: Don't try to recovery unmatched (and unused) chunks. If a RAID10 has an odd number of chunks - as might happen when there are an odd number of devices - the last chunk has no pair and so is not mirrored. We don't store data there, but when recovering the last device in an array we retry to recover that last chunk from a non-existent location. This results in an error, and the recovery aborts. When we get to that last chunk we should just stop - there is nothing more to do anyway. This bug has been present since the introduction of RAID10, so the patch is appropriate for any -stable kernel. Cc: stable@vger.kernel.org Reported-by: Christian Balzer <chibi@gol.com> Tested-by: Christian Balzer <chibi@gol.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 10:37:30 +10:00
NeilBrown	aba336bd1d	md: raid1/raid10: fix problem with merge_bvec_fn The new merge_bvec_fn which calls the corresponding function in subsidiary devices requires that mddev->merge_check_needed be set if any child has a merge_bvec_fn. However were were only setting that when a device was hot-added, not when a device was present from the start. This bug was introduced in 3.4 so patch is suitable for 3.4.y kernels. However that are conflicts in raid10.c so a separate patch will be needed for 3.4.y. Cc: stable@vger.kernel.org Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-31 15:56:30 +10:00
NeilBrown	63aced6102	md/raid10: Remove extras after reshape to smaller number of devices. When a reshape which reduced the number of devices finishes we must remove the extra devices. So ensure that raid10_remove_disk won't try to keep them, and have raid10_finish_reshape clear the 'in_sync' flag. Then remove_and_add_spares will be able to remove them. Reported-by: Hannes Reinecke <hare@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-22 13:55:33 +10:00
NeilBrown	bb63a7019d	md/raid10: resize bitmap when required during reshape. If a reshape changes the size of the array, then we can now update the bitmap to suit - so do so. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-22 13:55:28 +10:00
NeilBrown	a4a6125a07	md: allow array to be resized while bitmap is present. Now that bitmaps can be resized, we can allow an array to be resized while the bitmap is present. This only covers resizing that involves changing the effective size of member devices, not resizing that changes the number of devices. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-22 13:55:27 +10:00
majianpeng	5fdd2cf826	md/raid10: Fix memleak in r10buf_pool_alloc If the allocation of rep1_bio fails, we currently don't free the 'bio' of the same dev. Reported by kmemleak. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-22 13:55:03 +10:00
NeilBrown	3ea7daa5d7	md/raid10: add reshape support A 'near' or 'offset' lay RAID10 array can be reshaped to a different 'near' or 'offset' layout, a different chunk size, and a different number of devices. However the number of copies cannot change. Unlike RAID5/6, we do not support having user-space backup data that is being relocated during a 'critical section'. Rather, the data_offset of each device must change so that when writing any block to a new location, it will not over-write any data that is still 'live'. This means that RAID10 reshape is not supportable on v0.90 metadata. The different between the old data_offset and the new_offset must be at least the larger of the chunksize multiplied by offset copies of each of the old and new layout. (for 'near' mode, offset_copies == 1). A larger difference of around 64M seems useful for in-place reshapes as more data can be moved between metadata updates. Very large differences (e.g. 512M) seem to slow the process down due to lots of long seeks (on oldish consumer graded devices at least). Metadata needs to be updated whenever the place we are about to write to is considered - by the current metadata - to still contain data in the old layout. [unbalanced locking fix from Dan Carpenter <dan.carpenter@oracle.com>] Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-22 13:53:47 +10:00
NeilBrown	deb200d085	md/raid10: split out interpretation of layout to separate function. We will soon be interpreting the layout (and chunksize etc) from multiple places to support reshape. So split it out into separate function. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-21 09:28:33 +10:00
NeilBrown	f8c9e74ff0	md/raid10: Introduce 'prev' geometry to support reshape. When RAID10 supports reshape it will need a 'previous' and a 'current' geometry, so introduce that here. Use the 'prev' geometry when before the reshape_position, and the current 'geo' when beyond it. At other times, use both as appropriate. For now, both are identical (And reshape_position is never set). When we use the 'prev' geometry, we must use the old data_offset. When we use the current (And a reshape is happening) we must use the new_data_offset. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-21 09:28:33 +10:00
NeilBrown	5cf00fcd3c	md/raid10: collect some geometry fields into a dedicated structure. We will shortly be adding reshape support for RAID10 which will require it having 2 concurrent geometries (before and after). To make that easier, collect most geometry fields into 'struct geom' and access them from there. Then we will more easily be able to add a second set of fields. Note that 'copies' is not in this struct and so cannot be changed. There is little need to change this number and doing so is a lot more difficult as it requires reallocating more things. So leave it out for now. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-21 09:28:20 +10:00
NeilBrown	c6563a8c38	md: add possibility to change data-offset for devices. When reshaping we can avoid costly intermediate backup by changing the 'start' address of the array on the device (if there is enough room). So as a first step, allow such a change to be requested through sysfs, and recorded in v1.x metadata. (As we didn't previous check that all 'pad' fields were zero, we need a new FEATURE flag for this. A (belatedly) check that all remaining 'pad' fields are zero to avoid a repeat of this) The new data offset must be requested separately for each device. This allows each to have a different change in the data offset. This is not likely to be used often but as data_offset can be set per-device, new_data_offset should be too. This patch also removes the 'acknowledged' arg to rdev_set_badblocks as it is never used and never will be. At the same time we add a new arg ('in_new') which is currently always zero but will be used more soon. When a reshape finishes we will need to update the data_offset and rdev->sectors. So provide an exported function to do that. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-21 09:27:00 +10:00
NeilBrown	b0d634d568	md/raid10: fix transcription error in calc_sectors conversion. The old code was sector_div(stride, fc); the new code was sector_dir(size, conf->near_copies); 'size' is right (the stride various wasn't really needed), but 'fc' means 'far_copies', and that is an important difference. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-19 09:01:13 +10:00
NeilBrown	6508fdbf40	md/raid10: set dev_sectors properly when resizing devices in array. raid10 stores dev_sectors in 'conf' separately from the one in 'mddev' because it can have a very significant effect on block addressing and so need to be updated carefully. However raid10_resize isn't updating it at all! To update it correctly, we need to make sure it is a proper multiple of the chunksize taking various details of the layout in to account. This calculation is currently done in setup_conf. So split it out from there and call it from raid10_resize as well. Then set conf->dev_sectors properly. Signed-off-by: NeilBrown <neilb@suse.de>	2012-05-17 10:08:45 +10:00
majianpeng	f4380a9158	md/raid1,raid10: Fix calculation of 'vcnt' when processing error recovery. If r1bio->sectors % 8 != 0,then the memcmp and a later memcpy will omit the last bio_vec. This is suitable for any stable kernel since 3.1 when bad-block management was introduced. Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-04-12 16:04:47 +10:00
NeilBrown	5020ad7d14	md/raid1,raid10: don't compare excess byte during consistency check. When comparing two pages read from different legs of a mirror, only compare the bytes that were read, not the whole page. In most cases we read a whole page, but in some cases with bad blocks or odd sizes devices we might read fewer than that. This bug has been present "forever" but at worst it might cause a report of two many mismatches and generate a little bit extra resync IO, so there is no need to back-port to -stable kernels. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-04-03 15:39:23 +10:00
NeilBrown	006a09a0ae	md/raid10 - support resizing some RAID10 arrays. 'resizing' an array in this context means making use of extra space that has become available in component devices, not adding new devices. It also includes shrinking the array to take up less space of component devices. This is not supported for array with a 'far' layout. However for 'near' and 'offset' layout arrays, adding and removing space at the end of the devices is easy to support, and this patch provides that support. Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-19 12:46:40 +11:00
NeilBrown	050b66152f	md/raid10: handle merge_bvec_fn in member devices. Currently we don't honour merge_bvec_fn in member devices so if there is one, we force all requests to be single-page at most. This is not ideal. So enhance the raid10 merge_bvec_fn to check that function in children as well. This introduces a small problem. There is no locking around calls the ->merge_bvec_fn and subsequent calls to ->make_request. So a device added between these could end up getting a request which violates its merge_bvec_fn. Currently the best we can do is synchronize_sched(). This will work providing no preemption happens. If there is preemption, we just have to hope that new devices are largely consistent with old devices. Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-19 12:46:39 +11:00
NeilBrown	dafb20fa34	md: tidy up rdev_for_each usage. md.h has an 'rdev_for_each()' macro for iterating the rdevs in an mddev. However it uses the 'safe' version of list_for_each_entry, and so requires the extra variable, but doesn't include 'safe' in the name, which is useful documentation. Consequently some places use this safe version without needing it, and many use an explicity list_for_each entry. So: - rename rdev_for_each to rdev_for_each_safe - create a new rdev_for_each which uses the plain list_for_each_entry, - use the 'safe' version only where needed, and convert all other list_for_each_entry calls to use rdev_for_each. Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-19 12:46:39 +11:00
NeilBrown	d6b42dcb99	md/raid1,raid10: avoid deadlock during resync/recovery. If RAID1 or RAID10 is used under LVM or some other stacking block device, it is possible to enter a deadlock during resync or recovery. This can happen if the upper level block device creates two requests to the RAID1 or RAID10. The first request gets processed, blocks recovery and queue requests for underlying requests in current->bio_list. A resync request then starts which will wait for those requests and block new IO. But then the second request to the RAID1/10 will be attempted and it cannot progress until the resync request completes, which cannot progress until the underlying device requests complete, which are on a queue behind that second request. So allow that second request to proceed even though there is a resync request about to start. This is suitable for any -stable kernel. Cc: stable@vger.kernel.org Reported-by: Ray Morris <support@bettercgi.com> Tested-by: Ray Morris <support@bettercgi.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-19 12:46:38 +11:00
NeilBrown	dc10c643e8	md: allow re-add to failed arrays. When an array is failed (some data inaccessible) then there is no point attempting to add a spare as it could not possibly be recovered. However that may be value in re-adding a recently removed device. e.g. if there is a write-intent-bitmap and it is clear, then access to the data could be restored by this action. So don't reject a re-add to a failed array for RAID10 and RAID5 (the only arrays types that check for a failed array). Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-19 12:46:37 +11:00
NeilBrown	547414d19f	md/raid10: remove unnecessary smp_mb() from end_sync_write Recent commit `4ca40c2ce0` (md/raid10: Allow replacement device ...) added an smp_mb in end_sync_write. This was to close a possible race with raid10_remove_disk. However there is no such race as it is never attempted to remove a disk while resync (or recovery) is happening. so the smp_mb is just noise. Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-13 11:21:20 +11:00
NeilBrown	7a90484825	md/raid10: fix assembling of arrays with replacement devices. commit `56a2559bb6` (md/raid10: recognise replacements ...) changed 'run' to set ->replacement or ->rdev depending on the 'Replacement' status if the device, but it didn't remove the old unconditional setting of 'rdev'. So it was largely ineffective. So remove that now. Signed-off-by: NeilBrown <neilb@suse.de>	2012-03-06 10:12:45 +11:00
NeilBrown	fae8cc5ed0	md/raid10: fix handling of error on last working device in array. If we get a read error on the last working device in a RAID10 which contains the target block, then we don't fail the device (which is good) but we don't abort retries, which is wrong. We end up in an infinite loop retrying the read on the one device. This patch fixes the problem in two places: 1/ in raid10_end_read_request we don't even ask for a retry if this was the last usable device. This is efficient but a little racy and will sometimes retry when it should not. 2/ in handle_read_error we are careful to exclude any device from retry which we tried to mark as faulty (that might have failed if it was the last device). This is race-free but less efficient. Signed-off-by: NeilBrown <neilb@suse.de>	2012-02-14 11:10:10 +11:00
NeilBrown	b7044d41b5	md/raid10: If there is a spare and a want_replacement device, start replacement. When attempting to add a spare to a RAID10 array, also consider adding it as a replacement for a want_replacement device. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:56 +11:00
NeilBrown	56a2559bb6	md/raid10: recognise replacements when assembling array. If a Replacement is seen, file it as such. If we see two replacements (or two normal devices) for the one slot, abort. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:55 +11:00
NeilBrown	4ca40c2ce0	md/raid10: Allow replacement device to be replace old drive. When recovery finish and spare_active is called, check for a replace that might have just become fully synced and mark it as such, marking the original as failed. Then when the original is removed, move the replacement into its position. This means that 'replacement' and spontaneously become NULL in some situations. Make sure we check for those. It also means that 'rdev' and 'replacement' could appear to be identical - check for that too. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:55 +11:00
NeilBrown	24afd80d99	md/raid10: handle recovery of replacement devices. If there is a replacement device, then recover to it, reading from any drives - maybe the one being replaced, maybe not. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:55 +11:00
NeilBrown	9ad1aefc8a	md/raid10: Handle replacement devices during resync. If we need to resync an array which has replacement devices, we always write any block checked to every replacement. If the resync was bitmap-based resync we will then complete the replacement normally. If it was a full resync, we mark the replacements as fully recovered when the resync finishes so no further recovery is needed. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:55 +11:00
NeilBrown	475b0321a4	md/raid10: writes should get directed to replacement as well as original. When writing, we need to submit two writes, one to the original, and one to the replacements - if there is a replacement. If the write to the replacement results in a write error we just fail the device. We only try to record write errors to the original. This only handles writing new data. Writing for resync/recovery will come later. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:55 +11:00
NeilBrown	c8ab903ea9	md/raid10: allow removal of failed replacement devices. Enhance raid10_remove_disk to be able to remove ->replacement as well as ->rdev Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:54 +11:00
NeilBrown	abbf098e6e	md/raid10: preferentially read from replacement device if possible. When reading (for array reads, not for recovery etc) we read from the replacement device if it has recovered far enough. This requires storing the chosen rdev in the 'r10_bio' so we can make sure to drop the ref on the right device when the read finishes. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:54 +11:00
NeilBrown	96c3fd1f38	md/raid10: change read_balance to return an rdev It makes more sense to return an rdev than just an index as read_balance() gets a reference to the rdev and so returning the pointer make this more idiomatic. This will be needed in a future patch when we might return a 'replacement' rdev instead of the main rdev. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:54 +11:00
NeilBrown	69335ef3bc	md/raid10: prepare data structures for handling replacement. Allow each slot in the RAID10 to have 2 devices, the want_replacement and the replacement. Also an r10bio to have 2 bios, and for resync/recovery allocate the second bio if there are any replacement devices. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:54 +11:00
NeilBrown	b8321b68d1	md: change hot_remove_disk to take an rdev rather than a number. Soon an array will be able to have multiple devices with the same raid_disk number (an original and a replacement). So removing a device based on the number won't work. So pass the actual device handle instead. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:51 +11:00
Linus Torvalds	32aaeffbd4	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h	2011-11-06 19:44:47 -08:00
Linus Torvalds	b4fdcb02f1	Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c	2011-11-04 17:06:58 -07:00
Paul Gortmaker	056075c764	md: Add module.h to all files using it implicitly A pending cleanup will mean that module.h won't be implicitly everywhere anymore. Make sure the modular drivers in md dir are actually calling out for <module.h> explicitly in advance. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:31:18 -04:00
NeilBrown	7fcc7c8acf	md/raid10: Fix bug when activating a hot-spare. This is a fairly serious bug in RAID10. When a RAID10 array is degraded and a hot-spare is activated, the spare does not take up the empty slot, but rather replaces the first working device. This is likely to make the array non-functional. It would normally be possible to recover the data, but that would need care and is not guaranteed. This bug was introduced in commit `2bb77736ae` which first appeared in 3.1. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-31 12:59:44 +11:00
NeilBrown	d890fa2b05	md: Fix some bugs in recovery_disabled handling. In 3.0 we changed the way recovery_disabled was handle so that instead of testing against zero, we test an mddev-> value against a conf-> value. Two problems: 1/ one place in raid1 was missed and still sets to '1'. 2/ We didn't explicitly set the conf-> value at array creation time. It defaulted to '0' just like the mddev value does so they could appear equal and thus disable recovery. This did not affect normal 'md' as it calls bind_rdev_to_array which changes the mddev value. However the dmraid interface doesn't call this and so doesn't change ->recovery_disabled; so at array start all recovery is incorrectly disabled. So initialise the 'conf' value to one less that the mddev value, so the will only be the same when explicitly set that way. Reported-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-26 11:54:39 +11:00
Jens Axboe	5c04b426f2	Merge branch 'v3.1-rc10' into for-3.2/core Conflicts: block/blk-core.c include/linux/blkdev.h Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:30:42 +02:00
NeilBrown	34db0cd60f	md: add proper write-congestion reporting to RAID1 and RAID10. RAID1 and RAID10 handle write requests by queuing them for handling by a separate thread. This is because when a write-intent-bitmap is active we might need to update the bitmap first, so it is good to queue a lot of writes, then do one big bitmap update for them all. However writeback request devices to appear to be congested after a while so it can make some guesstimate of throughput. The infinite queue defeats that (note that RAID5 has already has a finite queue so it doesn't suffer from this problem). So impose a limit on the number of pending write requests. By default it is 1024 which seems to be generally suitable. Make it configurable via module option just in case someone finds a regression. Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:50:01 +11:00
NeilBrown	84fc4b56db	md: rename "mdk_personality" to "md_personality" "mdk" doesn't mean anything any more. Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:49:58 +11:00
NeilBrown	e879a8793f	md/raid10: typedef removal: conf_t -> struct r10conf Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:49:02 +11:00
NeilBrown	e373ab1091	md/raid0: typedef removal: raid0_conf_t -> struct r0conf Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:48:59 +11:00
NeilBrown	0f6d02d580	md: remove typedefs: mirror_info_t -> struct mirror_info Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:48:46 +11:00
NeilBrown	9f2c9d12bc	md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bio Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:48:43 +11:00
NeilBrown	fd01b88c75	md: remove typedefs: mddev_t -> struct mddev Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:47:53 +11:00
NeilBrown	3cb0300200	md: removing typedefs: mdk_rdev_t -> struct md_rdev The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:45:26 +11:00
NeilBrown	01f96c0a99	md: Avoid waking up a thread after it has been freed. Two related problems: 1/ some error paths call "md_unregister_thread(mddev->thread)" without subsequently clearing ->thread. A subsequent call to mddev_unlock will try to wake the thread, and crash. 2/ Most calls to md_wakeup_thread are protected against the thread disappeared either by: - holding the ->mutex - having an active request, so something else must be keeping the array active. However mddev_unlock calls md_wakeup_thread after dropping the mutex and without any certainty of an active request, so the ->thread could theoretically disappear. So we need a spinlock to provide some protections. So change md_unregister_thread to take a pointer to the thread pointer, and ensure that it always does the required locking, and clears the pointer properly. Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> cc: stable@kernel.org	2011-09-21 15:30:20 +10:00
Christoph Hellwig	5a7bbad27a	block: remove support for bio remapping from ->make_request There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:12:01 +02:00
NeilBrown	079fa166a2	md/raid1,10: Remove use-after-free bug in make_request. A single request to RAID1 or RAID10 might result in multiple requests if there are known bad blocks that need to be avoided. To detect if we need to submit another write request we test: if (sectors_handled < (bio->bi_size >> 9)) { However this is after we call _write_done() so the 'bio' no longer belongs to us - the writes could have completed and the bio freed. So move the _write_done call until after the test against bio->bi_size. This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862 Reported-by: Bruno Wolff III <bruno@wolff.to> Tested-by: Bruno Wolff III <bruno@wolff.to> Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-10 17:21:23 +10:00
NeilBrown	19d5f834d6	md/raid10: unify handling of write completion. A write can complete at two different places: 1/ when the last member-device write completes, through raid10_end_write_request 2/ in make_request() when we remove the initial bias from ->remaining. These two should do exactly the same thing and the comment says they do, but they don't. So factor the correct code out into a function and call it in both places. This makes the code much more similar to RAID1. The difference is only significant if there is an error, and they usually take a while, so it is unlikely that there will be an error already when make_request is completing, so this is unlikely to cause real problems. Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-10 17:21:17 +10:00
NeilBrown	58c54fcca3	md/raid10: handle further errors during fix_read_error better. If we find more read/write errors we should record a bad block before failing the device. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:25 +10:00
NeilBrown	5e5702898e	md/raid10: Handle read errors during recovery better. Currently when we get a read error during recovery, we simply abort the recovery. Instead, repeat the read in page-sized blocks. On successful reads, write to the target. On read errors, record a bad block on the destination, and only if that fails do we abort the recovery. As we now retry reads we need to know where we read from. This was in bi_sector but that can be changed during a read attempt. So store the correct from_addr and to_addr in the r10_bio for later access. Signed-off-by: NeilBrown<neilb@suse.de>	2011-07-28 11:39:25 +10:00
NeilBrown	e684e41db3	md/raid10: simplify read error handling during recovery. If a read error is detected during recovery the code currently fails the read device. This isn't really necessary. recovery_request_write will signal a write error to end_sync_write and it will record a write error on the destination device which will record a bad block there or kick it from the array. So just remove this call to do md_error. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:25 +10:00
NeilBrown	1a0b7cd826	md/raid10: record bad blocks due to write errors during resync/recovery. If we get a write error during resync/recovery don't fail the device but instead record a bad block. If that fails we can then fail the device. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:25 +10:00
NeilBrown	f84ee364dd	md/raid10: attempt to fix read errors during resync/check We already attempt to fix read errors found during normal IO and a 'repair' process. It is best to try to repair them at any time they are found, so move a test so that during sync and check a read error will be corrected by over-writing with good data. If both (all) devices have known bad blocks in the sync section we won't try to fix even though the bad blocks might not overlap. That should be considered later. Also if we hit a read error during recovery we don't try to fix it. It would only be possible to fix if there were at least three copies of data, which is not very common with RAID10. But it should still be considered later. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:25 +10:00
NeilBrown	bd870a16c5	md/raid10: Handle write errors by updating badblock log. When we get a write error (in the data area, not in metadata), update the badblock log rather than failing the whole device. As the write may well be many blocks, we trying writing each block individually and only log the ones which fail. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	749c55e942	md/raid10: clear bad-block record when write succeeds. If we succeed in writing to a block that was recorded as being bad, we clear the bad-block record. This requires some delayed handling as the bad-block-list update has to happen in process-context. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	d4432c23be	md/raid10: avoid writing to known bad blocks on known bad drives. Writing to known bad blocks on drives that have seen a write error is asking for trouble. So try to avoid these blocks. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	e875ecea26	md/raid10 record bad blocks as needed during recovery. When recovering one or more devices, if all the good devices have bad blocks we should record a bad block on the device being rebuilt. If this fails, we need to abort the recovery. To ensure we don't think that we aborted later than we actually did, we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync, in particular before mddev->curr_resync is updated. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	40c356ce5a	md/raid10: avoid reading known bad blocks during resync/recovery. During resync/recovery limit the size of the request to avoid reading into a bad block that does not start at-or-before the current read address. Similarly if there is a bad block at this address, don't allow the current request to extend beyond the end of that bad block. Now that we don't ever read from known bad blocks, it is safe to allow devices with those blocks into the array. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	8dbed5cebd	md/raid10 - avoid reading from known bad blocks - part 3 When attempting to repair a read error, don't read from devices with a known bad block. As we are only reading PAGE_SIZE blocks, we don't try to narrow down to smaller regions in the hope that only part of this page is bad - it isn't worth the effort. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	7399c31bc9	md/raid10: avoid reading from known bad blocks - part 2 When redirecting a read error to a different device, we must again avoid bad blocks and possibly split the request. Spin_lock typo fixed thanks to Dan Carpenter <error27@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:23 +10:00
NeilBrown	856e08e237	md/raid10: avoid reading from known bad blocks - part 1 This patch just covers the basic read path: 1/ read_balance needs to check for badblocks, and return not only the chosen slot, but also how many good blocks are available there. 2/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' On read error we currently just fail the request if another target cannot handle the whole request. Next patch refines that a bit. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:23 +10:00
NeilBrown	560f8e5532	md/raid10: Split handle_read_error out from raid10d. raid10d() is too big and is about to get bigger, so split handle_read_error() out as a separate function. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:23 +10:00
NeilBrown	1294b9c973	md/raid10: simplify/reindent some loops. When a loop ends with a large if, it can be neater to change the if to invert the condition and just 'continue'. Then the body of the if can be indented to a lower level. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:23 +10:00
NeilBrown	de393cdea6	md: make it easier to wait for bad blocks to be acknowledged. It is only safe to choose not to write to a bad block if that bad block is safely recorded in metadata - i.e. if it has been 'acknowledged'. If it hasn't we need to wait for the acknowledgement. We support that using rdev->blocked wait and md_wait_for_blocked_rdev by introducing a new device flag 'BlockedBadBlock'. This flag is only advisory. It is cleared whenever we acknowledge a bad block, so that a waiter can re-check the particular bad blocks that it is interested it. It should be set by a caller when they find they need to wait. This (set after test) is inherently racy, but as md_wait_for_blocked_rdev already has a timeout, losing the race will have minimal impact. When we clear "Blocked" was also clear "BlockedBadBlocks" incase it was set incorrectly (see above race). We also modify the way we manage 'Blocked' to fit better with the new handling of 'BlockedBadBlocks' and to make it consistent between externally managed and internally managed metadata. This requires that each raidXd loop checks if the metadata needs to be written and triggers a write (md_check_recovery) if needed. Otherwise a queued write request might cause raidXd to wait for the metadata to write, and only that thread can write it. Before writing metadata, we set FaultRecorded for all devices that are Faulty, then after writing the metadata we clear Blocked for any device for which the Fault was certainly Recorded. The 'faulty' device flag now appears in sysfs if the device is faulty or it has unacknowledged bad blocks. So user-space which does not understand bad blocks can continue to function correctly. User space which does, should not assume a device is faulty until it sees the 'faulty' flag, and then sees the list of unacknowledged bad blocks is empty. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:48 +10:00
NeilBrown	34b343cff4	md: don't allow arrays to contain devices with bad blocks. As no personality understand bad block lists yet, we must reject any device that is known to contain bad blocks. As the personalities get taught, these tests can be removed. This only applies to raid1/raid5/raid10. For linear/raid0/multipath/faulty the whole concept of bad blocks doesn't mean anything so there is no point adding the checks. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:47 +10:00
Namhyung Kim	cbea21703b	md/raid10: move rdev->corrected_errors counting Read errors are considered to corrected if write-back and re-read cycle is finished without further problems. Thus moving the rdev-> corrected_errors counting after the re-reading looks more reasonable IMHO. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
NeilBrown	700c721389	md/raid10: Improve decision on whether to fail a device with a read error. Normally we would fail a device with a READ error. However if doing so causes the array to fail, it is better to leave the device in place and just return the read error to the caller. The current test for decide if the array will fail is overly simplistic. We have a function 'enough' which can tell if the array is failed or not, so use it to guide the decision. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
NeilBrown	2bb77736ae	md/raid10: Make use of new recovery_disabled handling When we get a read error during recovery, RAID10 previously arranged for the recovering device to appear to fail so that the recovery stops and doesn't restart. This is misleading and wrong. Instead, make use of the new recovery_disabled handling and mark the target device and having recovery disabled. Add appropriate checks in add_disk and remove_disk so that devices are removed and not re-added when recovery is disabled. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Christian Dietrich	8bda470e8e	md/raid: use printk_ratelimited instead of printk_ratelimit As per printk_ratelimit comment, it should not be used. Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Namhyung Kim	c65060ad42	md/raid10: share pages between read and write bio's during recovery When performing a recovery, only first 2 slots in r10_bio are in use, for read and write respectively. However all of pages in the write bio are never used and just replaced to read bio's when the read completes. Get rid of those unused pages and share read pages properly. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-18 17:38:49 +10:00
Namhyung Kim	778ca01852	md/raid10: factor out common bio handling code When normal-write and sync-read/write bio completes, we should find out the disk number the bio belongs to. Factor those common code out to a separate function. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-18 17:38:47 +10:00
Namhyung Kim	2c4193df37	md/raid10: get rid of duplicated conditional expression Variable 'first' is initialized to zero and updated to @rdev->raid_disk only if it is greater than 0. Thus condition '>= first' always implies '>= 0' so the latter is not needed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-18 17:38:43 +10:00
NeilBrown	ab9d47e990	md/raid10: reformat some loops with less indenting. When a loop ends with an 'if' with a large body, it is neater to make the if 'continue' on the inverse condition, and then the body is indented less. Apply this pattern 3 times, and wrap some other long lines. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:54:41 +10:00
NeilBrown	f17ed07c85	md/raid10: remove unused variable. This variable 'disk' is never used - how odd. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:54:32 +10:00
NeilBrown	a8830bcaf3	md/raid10: make more use of 'slot' in raid10d. Now that we have a 'slot' variable, make better use of it to simplify some code a little. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:54:19 +10:00
NeilBrown	7c4e06ff2b	md/raid10: some tidying up in fix_read_error Currently the rdev on which a read error happened could be removed before we perform the fix_error handling. This requires extra tests for NULL. So delay the rdev_dec_pending call until after the call to fix_read_error so that we can be sure that the rdev still exists. This allows an 'if' clause to be removed so the body gets re-indented back one level. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:53:17 +10:00
NeilBrown	56d9912106	md: simplify raid10 read_balance raid10 read balance has two different loop for looking through possible devices to chose the best. Collapse those into one loop and generally make the code more readable. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:27:03 +10:00
NeilBrown	c3b328ac84	md: fix up raid1/raid10 unplugging. We just need to make sure that an unplug event wakes up the md thread, which is exactly what mddev_check_plugged does. Also remove some plug-related code that is no longer needed. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-18 18:25:43 +10:00
NeilBrown	e1dfa0a297	md: use new plugging interface for RAID IO. md/raid submits a lot of IO from the various raid threads. So adding start/finish plug calls to those so that some plugging happens. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-18 18:25:41 +10:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Martin K. Petersen	a91a2785b2	block: Require subsystems to explicitly allocate bio_set integrity mempool MD and DM create a new bio_set for every metadevice. Each bio_set has an integrity mempool attached regardless of whether the metadevice is capable of passing integrity metadata. This is a waste of memory. Instead we defer the allocation decision to MD and DM since we know at metadevice creation time whether integrity passthrough is needed or not. Automatic integrity mempool allocation can then be removed from bioset_create() and we make an explicit integrity allocation for the fs_bio_set. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Acked-by: Mike Snitzer <snizer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-17 11:11:05 +01:00
Jens Axboe	4c63f5646e	Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/core Conflicts: block/blk-core.c block/blk-flush.c drivers/md/raid1.c drivers/md/raid10.c drivers/md/raid5.c fs/nilfs2/btnode.c fs/nilfs2/mdt.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:58:35 +01:00
Jens Axboe	7eaceaccab	block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:07 +01:00
NeilBrown	da9cf5050a	md: avoid spinlock problem in blk_throtl_exit blk_throtl_exit assumes that ->queue_lock still exists, so make sure that it does. To do this, we stop redirecting ->queue_lock to conf->device_lock and leave it pointing where it is initialised - __queue_lock. As the blk_plug functions check the ->queue_lock is held, we now take that spin_lock explicitly around the plug functions. We don't need the locking, just the warning removal. This is needed for any kernel with the blk_throtl code, which is which is 2.6.37 and later. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-02-21 18:25:57 +11:00
Krzysztof Wojcik	02214dc546	FIX: md: process hangs at wait_barrier after 0->10 takeover Following symptoms were observed: 1. After raid0->raid10 takeover operation we have array with 2 missing disks. When we add disk for rebuild, recovery process starts as expected but it does not finish- it stops at about 90%, md126_resync process hangs in "D" state. 2. Similar behavior is when we have mounted raid0 array and we execute takeover to raid10. After this when we try to unmount array- it causes process umount hangs in "D" In scenarios above processes hang at the same function- wait_barrier in raid10.c. Process waits in macro "wait_event_lock_irq" until the "!conf->barrier" condition will be true. In scenarios above it never happens. Reason was that at the end of level_store, after calling pers->run, we call mddev_resume. This calls pers->quiesce(mddev, 0) with RAID10, that calls lower_barrier. However raise_barrier hadn't been called on that 'conf' yet, so conf->barrier becomes negative, which is bad. This patch introduces setting conf->barrier=1 after takeover operation. It prevents to become barrier negative after call lower_barrier(). Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-02-08 11:49:02 +11:00
Jonathan Brassow	ccebd4c415	md-new-param-to_sync_page_io Add new parameter to 'sync_page_io'. The new parameter allows us to distinguish between metadata and data operations. This becomes important later when we add the ability to use separate devices for data and metadata. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>	2011-01-14 09:14:33 +11:00
Joe Perches	067032bc62	md: Fix single printks with multiple KERN_<level>s Noticed-by: Russell King <linux@arm.linux.org.uk> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:14:33 +11:00
NeilBrown	589a594be1	md: protect against NULL reference when waiting to start a raid10. When we fail to start a raid10 for some reason, we call md_unregister_thread to kill the thread that was created. Unfortunately md_thread() will then make one call into the handler (raid10d) even though md_wakeup_thread has not been called. This is not safe and as md_unregister_thread is called after mddev->private has been set to NULL, it will definitely cause a NULL dereference. So fix this at both ends: - md_thread should only call the handler if THREAD_WAKEUP has been set. - raid10 should call md_unregister_thread before setting things to NULL just like all the other raid modules do. This is applicable to 2.6.35 and later. Cc: stable@kernel.org Reported-by: "Citizen" <citizen_lee@thecus.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-12-09 17:02:14 +11:00
NeilBrown	a167f66324	md: use separate bio pool for each md device. bio_clone and bio_alloc allocate from a common bio pool. If an md device is stacked with other devices that use this pool, or under something like swap which uses the pool, then the multiple calls on the pool can cause deadlocks. So allocate a local bio pool for each md array and use that rather than the common pool. This pool is used both for regular IO and metadata updates. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:36:15 +11:00
NeilBrown	2b193363ef	md: change type of first arg to sync_page_io. Currently sync_page_io takes a 'bdev'. Every caller passes 'rdev->bdev'. We will soon want another field out of the rdev in sync_page_io, So just pass the rdev instead of the bdev out of it. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:36:11 +11:00
NeilBrown	6746557f03	md: use bio_kmalloc rather than bio_alloc when failure is acceptable. bio_alloc can never fail (as it uses a mempool) but an block indefinitely, especially if the caller is holding a reference to a previously allocated bio. So these to places which both handle failure and hold multiple bios should not use bio_alloc, they should use bio_kmalloc. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:36:06 +11:00
NeilBrown	4e78064f42	md: Fix possible deadlock with multiple mempool allocations. It is not safe to allocate from a mempool while holding an item previously allocated from that mempool as that can deadlock when the mempool is close to exhaustion. So don't use a bio list to collect the bios to write to multiple devices in raid1 and raid10. Instead queue each bio as it becomes available so an unplug will activate all previously allocated bios and so a new bio has a chance of being allocated. This means we must set the 'remaining' count to '1' before submitting any requests, then when all are submitted, decrement 'remaining' and possible handle the write completion at that point. Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com> Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:34:07 +11:00
NeilBrown	57dab0bdf6	md: use sector_t in bitmap_get_counter bitmap_get_counter returns the number of sectors covered by the counter in a pass-by-reference variable. In some cases this can be very large, so make it a sector_t for safety. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:32:26 +11:00
Tejun Heo	e9c7469bb4	md: implment REQ_FLUSH/FUA support This patch converts md to support REQ_FLUSH/FUA instead of now deprecated REQ_HARDBARRIER. In the core part (md.c), the following changes are notable. * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with processing of other requests and thus there is no reason to mark the queue congested while FLUSH/FUA is in progress. * REQ_FLUSH/FUA failures are final and its users don't need retry logic. Retry logic is removed. * Preflush needs to be issued to all member devices but FUA writes can be handled the same way as other writes - their processing can be deferred to request_queue of member devices. md_barrier_request() is renamed to md_flush_request() and simplified accordingly. For linear, raid0 and multipath, the core changes are enough. raid1, 5 and 10 need the following conversions. * raid1: Handling of FLUSH/FUA bio's can simply be deferred to request_queues of member devices. Barrier related logic removed. * raid5: Queue draining logic dropped. FUA bit is propagated through biodrain and stripe resconstruction such that all the updated parts of the stripe are written out with FUA writes if any of the dirtying writes was FUA. preread_active_stripes handling in make_request() is updated as suggested by Neil Brown. * raid10: FUA bit needs to be propagated to write clones. linear, raid0, 1, 5 and 10 tested. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-10 12:35:38 +02:00
NeilBrown	2c7d46ec19	md raid-1/10 Fix bio_rw bit manipulations again commit `7b6d91daee` changed the behaviour of a few variables in raid1 and raid10 from flags to bit-sets, but left them as type 'bool' so they did not work. Change them (back) to unsigned long. (historical note: see `1ef04fefe2`) Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: Jiri Slaby <jslaby@suse.cz> and many others	2010-08-18 16:16:05 +10:00
NeilBrown	6b96562054	md: provide appropriate return value for spare_active functions. md_check_recovery expects ->spare_active to return 'true' if any spares were activated, but none of them do, so the consequent change in 'degraded' is not notified through sysfs. So count the number of spares activated, subtract it from 'degraded' just once, and return it. Reported-by: Adrian Drzewiecki <adriand@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-18 12:04:32 +10:00
Adrian Drzewiecki	e6ffbcb6cd	md: Notify sysfs when RAID1/5/10 disk is In_sync. When RAID1 is done syncing disks, it'll update the state of synced rdevs to In_sync. But it neglected to notify sysfs that the attribute changed. So any programs that are waiting for an rdev's state to change will not be woken. (raid5/raid10 added by neilb) Signed-off-by: Adrian Drzewiecki <adriand@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-18 11:49:02 +10:00
Linus Torvalds	3d30701b58	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: (24 commits) md: clean up do_md_stop md: fix another deadlock with removing sysfs attributes. md: move revalidate_disk() back outside open_mutex md/raid10: fix deadlock with unaligned read during resync md/bitmap: separate out loading a bitmap from initialising the structures. md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. md/bitmap: optimise scanning of empty bitmaps. md/bitmap: clean up plugging calls. md/bitmap: reduce dependence on sysfs. md/bitmap: white space clean up and similar. md/raid5: export raid5 unplugging interface. md/plug: optionally use plugger to unplug an array during resync/recovery. md/raid5: add simple plugging infrastructure. md/raid5: export is_congested test raid5: Don't set read-ahead when there is no queue md: add support for raising dm events. md: export various start/stop interfaces md: split out md_rdev_init md: be more careful setting MD_CHANGE_CLEAN md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk ...	2010-08-10 15:38:19 -07:00
Christoph Hellwig	7b6d91daee	block: unify flags for struct bio and struct request Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:20:39 +02:00
NeilBrown	51e9ac7703	md/raid10: fix deadlock with unaligned read during resync If the 'bio_split' path in raid10-read is used while resync/recovery is happening it is possible to deadlock. Fix this be elevating ->nr_waiting for the duration of both parts of the split request. This fixes a bug that has been present since 2.6.22 but has only started manifesting recently for unknown reasons. It is suitable for and -stable since then. Reported-by: Justin Bronder <jsbronder@gentoo.org> Tested-by: Justin Bronder <jsbronder@gentoo.org> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2010-08-07 21:17:00 +10:00

... 2 3 4 5 6 ...

487 Commits