OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Mikulas Patocka	f0b9a4502b	dm: move bio_io_error into __split_and_process_bio Move the bio_io_error() calls directly into __split_and_process_bio(). This avoids some code duplication in later patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:38 +01:00
Mikulas Patocka	8a53c28db4	dm: rename __split_bio Rename __split_bio() to __split_and_process_bio() because it not only splits the bio to serveral parts, but also submits them to target drivers. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:37 +01:00
Mikulas Patocka	53d5914f28	dm: remove unnecessary struct dm_wq_req Remove struct dm_wq_req and move "work" directly into struct mapped_device. In the revised implementation, the thread will do just one type of work (processing the queue). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:37 +01:00
Mikulas Patocka	9a1fb46448	dm: remove unnecessary work queue context field Remove the context field from struct dm_wq_req because we will no longer need it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:36 +01:00
Mikulas Patocka	143773965b	dm: remove unnecessary work queue type field Remove "type" field from struct dm_wq_req because we no longer need it to have more than one value. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:36 +01:00
Mikulas Patocka	99c75e3130	dm: bio list add bio_list_add_head Introduce a function that adds a bio to the head of the list for use by the patch that will support barriers. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:36 +01:00
Jonathan Brassow	a32079ce17	dm snapshot: persistent fix dtr cleanup The persistent exception store destructor does not properly account for all conditions in which it can be called. If it is called after 'ctr' but before 'read_metadata' (e.g. if something else in 'snapshot_ctr' fails) then it will attempt to free areas of memory that haven't been allocated yet. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:35 +01:00
Jonathan Brassow	1e302a929e	dm snapshot: move status to exception store Let the exception store types print out their status through the new API, rather than having the snapshot code do it. Adjust the buffer position to allow for the preceding DMEMIT in the arguments to type->status(). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:35 +01:00
Jonathan Brassow	fee1998e9c	dm snapshot: move ctr parsing to exception store First step of having the exception stores parse their own arguments - generalizing the interface. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:34 +01:00
Jonathan Brassow	2e4a31df2b	dm snapshot: use DMEMIT macro for status Use DMEMIT in place of snprintf. This makes it easier later when other modules are helping to populate our status output. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:34 +01:00
Jonathan Brassow	ccc45ea8ae	dm snapshot: remove dm_snap header Move some of the last bits from dm-snap.h into dm-snap.c where they belong and remove dm-snap.h. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:34 +01:00
Jonathan Brassow	71fab00a6b	dm snapshot: remove dm_snap header use Move useful functions out of dm-snap.h and stop using dm-snap.h. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:33 +01:00
Jonathan Brassow	49beb2b87a	dm exception store: move cow pointer Move COW device from snapshot to exception store. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:33 +01:00
Jonathan Brassow	d021684951	dm exception store: move chunk_fields Move chunk fields from snapshot to exception store. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:32 +01:00
Jonathan Brassow	0cea9c7827	dm exception store: move dm_target pointer Move target pointer from snapshot to exception store. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:32 +01:00
Jonathan Brassow	493df71c64	dm exception store: introduce registry Move exception stores into a registry. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:31 +01:00
Jonathan Brassow	7513c2a761	dm raid1: add is_remote_recovering hook for clusters The logging API needs an extra function to make cluster mirroring possible. This new function allows us to check whether a mirror region is being recovered on another machine in the cluster. This helps us prevent simultaneous recovery I/O and process I/O to the same locations on disk. Cluster-aware log modules will implement this function. Single machine log modules will not. So, there is no performance penalty for single machine mirrors. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:30 +01:00
Jonathan Brassow	b2a1146529	dm exception store: separate type from instance Introduce struct dm_exception_store_type. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:30 +01:00
Mike Snitzer	ec44ab9d66	dm log: remove struct dm_dirty_log_internal Remove the 'dm_dirty_log_internal' structure. The resulting cleanup eliminates extra memory allocations. Therefore exposing the internal list_head to the external 'dm_dirty_log_type' structure is a worthwhile compromise. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:30 +01:00
Mike Snitzer	84e67c9319	dm log: use standard kernel module refcount Avoid private module usage accounting by removing 'use' from dm_dirty_log_internal. The standard module reference counting is sufficient. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:29 +01:00
Johannes Weiner	b81d6cf79b	dm crypt: use kzfree Use kzfree() instead of memset() + kfree(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:28 +01:00
Cheng Renquan	45194e4f89	dm target: remove struct tt_internal The tt_internal is really just a list_head to manage registered target_type in a double linked list, Here embed the list_head into target_type directly, 1. to avoid kmalloc/kfree; 2. then tt_internal is really unneeded; Cc: stable@kernel.org Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:28 +01:00
Alasdair G Kergon	570b9d968b	dm table: fix upgrade mode race upgrade_mode() sets bdev to NULL temporarily, and does not have any locking to exclude anything from seeing that NULL. In dm_table_any_congested() bdev_get_queue() can dereference that NULL and cause a reported oops. Fix this by not changing that field during the mode upgrade. Cc: stable@kernel.org Cc: Neil Brown <neilb@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:28 +01:00
Jun'ichi Nomura	aea9058801	dm: path selector use module refcount directly Fix refcount corruption in dm-path-selector Refcounting with non-atomic ops under shared lock will corrupt the counter in multi-processor system and may trigger BUG_ON(). Use module refcount. # same approach as dm-target-use-module-refcount-directly.patch here # https://www.redhat.com/archives/dm-devel/2008-December/msg00075.html Typical oops: kernel BUG at linux-2.6.29-rc3/drivers/md/dm-path-selector.c:90! Pid: 11148, comm: dmsetup Not tainted 2.6.29-rc3-nm #1 dm_put_path_selector+0x4d/0x61 [dm_multipath] Call Trace: [<ffffffffa031d3f9>] free_priority_group+0x33/0xb3 [dm_multipath] [<ffffffffa031d4aa>] free_multipath+0x31/0x67 [dm_multipath] [<ffffffffa031d50d>] multipath_dtr+0x2d/0x32 [dm_multipath] [<ffffffffa015d6c2>] dm_table_destroy+0x64/0xd8 [dm_mod] [<ffffffffa015b73a>] __unbind+0x46/0x4b [dm_mod] [<ffffffffa015b79f>] dm_swap_table+0x60/0x14d [dm_mod] [<ffffffffa015f963>] dev_suspend+0xfd/0x177 [dm_mod] [<ffffffffa0160250>] dm_ctl_ioctl+0x24c/0x29c [dm_mod] [<ffffffff80288cd3>] ? get_page_from_freelist+0x49c/0x61d [<ffffffffa015f866>] ? dev_suspend+0x0/0x177 [dm_mod] [<ffffffff802bf05c>] vfs_ioctl+0x2a/0x77 [<ffffffff802bf4f1>] do_vfs_ioctl+0x448/0x4a0 [<ffffffff802bf5a0>] sys_ioctl+0x57/0x7a [<ffffffff8020c05b>] system_call_fastpath+0x16/0x1b Cc: stable@kernel.org Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:27 +01:00
Cheng Renquan	5642b8a61a	dm target: use module refcount directly The tt_internal's 'use' field is superfluous: the module's refcount can do the work properly. An acceptable side-effect is that this increases the reference counts reported by 'lsmod'. Remove the superfluous test when removing a target module. [Crash possible without this on SMP - agk] Cc: stable@kernel.org Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>	2009-04-02 19:55:27 +01:00
Mikulas Patocka	35bf659b00	dm snapshot: avoid having two exceptions for the same chunk We need to check if the exception was completed after dropping the lock. After regaining the lock, __find_pending_exception checks if the exception was already placed into &s->pending hash. But we don't check if the exception was already completed and placed into &s->complete hash. If the process waiting in alloc_pending_exception was delayed at this point because of a scheduling latency and the exception was meanwhile completed, we'd miss that and allocate another pending exception for already completed chunk. It would lead to a situation where two records for the same chunk exist and potential data corruption because multiple snapshot I/Os to the affected chunk could be redirected to different locations in the snapshot. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:26 +01:00
Mikulas Patocka	c66213921c	dm snapshot: avoid dropping lock in __find_pending_exception It is uncommon and bug-prone to drop a lock in a function that is called with the lock held, so this is moved to the caller. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:25 +01:00
Mikulas Patocka	2913808eb5	dm snapshot: refactor __find_pending_exception Move looking-up of a pending exception from __find_pending_exception to another function. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:25 +01:00
Mikulas Patocka	b64b6bf4fd	dm io: make sync_io uninterruptible If someone sends signal to a process performing synchronous dm-io call, the kernel may crash. The function sync_io attempts to exit with -EINTR if it has pending signal, however the structure "io" is allocated on stack, so already submitted io requests end up touching unallocated stack space and corrupting kernel memory. sync_io sets its state to TASK_UNINTERRUPTIBLE, so the signal can't break out of io_schedule() --- however, if the signal was pending before sync_io entered while (1) loop, the corruption of kernel memory will happen. There is no way to cancel in-progress IOs, so the best solution is to ignore signals at this point. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:24 +01:00
Mikulas Patocka	95f8fac8dc	dm raid1: switch read_record from kmalloc to slab to save memory With my previous patch to save bi_io_vec, the size of dm_raid1_read_record is significantly increased (the vector list takes 3072 bytes on 32-bit machines and 4096 bytes on 64-bit machines). The structure dm_raid1_read_record used to be allocated with kmalloc, but kmalloc aligns the size on the next power-of-two so an object slightly greater than 4096 will allocate 8192 bytes of memory and half of that memory will be wasted. This patch turns kmalloc into a slab cache which doesn't have this padding so it will reduce the memory consumed. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:24 +01:00
Mikulas Patocka	a920f6b3ac	dm: preserve bi_io_vec when resubmitting bios Device mapper saves and restores various fields in the bio, but it doesn't save bi_io_vec. If the device driver modifies this after a partially successful request, dm-raid1 and dm-multipath may attempt to resubmit a bio that has bi_size inconsistent with the size of vector. To make requests resubmittable in dm-raid1 and dm-multipath, we must save and restore the bio vector as well. To reduce the memory overhead involved in this, we do not save the pages in a vector and use a 16-bit field size if the page size is less than 65536. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-04-02 19:55:23 +01:00
NeilBrown	c8f517c444	md/raid5 revise rules for when to update metadata during reshape We currently update the metadata : 1/ every 3Megabytes 2/ When the place we will write new-layout data to is recorded in the metadata as still containing old-layout data. Rule one exists to avoid having to re-do too much reshaping in the face of a crash/restart. So it should really be time based rather than size based. So change it to "every 10 seconds". Rule two turns out to be too harsh when restriping an array 'in-place', as in that case the metadata much be updates for every stripe. For the in-place update, it can only possibly be safe from a crash if some user-space program data a backup of every e.g. few hundred stripes before allowing them to be reshaped. In that case, the constant metadata update is pointless. So only update the metadata if the new metadata will report that the end of the 'old-layout' data is beyond where we are currently writing 'new-layout' data. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:28:40 +11:00
NeilBrown	b0f9ec047b	md/raid5: minor code cleanups in make_request. ... and to be certain the that make_request doesn't wait forever, add a 'wake_up' when ->reshape_progress has been set to MaxSector Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:27:18 +11:00
NeilBrown	2cffc4a01d	md: remove CONFIG_MD_RAID_RESHAPE config option. This was only needed when the code was experimental. Most of it is well tested now, so the option is no longer useful. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:27:05 +11:00
NeilBrown	ab69ae12ce	md/raid5: be more careful about write ordering when reshaping. When we are reshaping an array, it is very important that we read the data from a particular sector offset before writing new data at that offset. In most cases when growing or shrinking an array we read long before we even consider writing. But when restriping an array without changing it size, there is a small possibility that we might have some data to available write before the read has happened at the same location. This would require some stripes to be in cache already. To guard against this small possibility, we check, before writing, that the 'old' stripe at the same location is not in the process of being read. And we ensure that we mark all 'source' stripes as such before allowing new 'destination' stripes to proceed. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:26:47 +11:00
NeilBrown	d1a7c50369	md: don't display meaningless values in sysfs files resync_start and sync_speed When no resync if happening, both of these files currently have meaningless values (is slightly different ways). Change them to "none" in that case. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:24:32 +11:00
NeilBrown	88ce4930e2	md/raid5: allow layout and chunksize to be changed on active array. If an array has 3 or more devices, we allow the chunksize or layout to be changed and when a reshape starts, we use these as the 'new' values. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:24:23 +11:00
NeilBrown	7a66138107	md/raid5: reshape using largest of old and new chunk size This ensures that even when old and new stripes are overlapping, we will try to read all of the old before having to write any of the new. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:21:40 +11:00
NeilBrown	e183eaedd5	md/raid5: prepare for allowing reshape to change layout Add prev_algo to raid5_conf_t along the same lines as prev_chunk and previous_raid_disks. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:20:22 +11:00
NeilBrown	784052ecc6	md/raid5: prepare for allowing reshape to change chunksize. Add "prev_chunk" to raid5_conf_t, similar to "previous_raid_disks", to remember what the chunk size was before the reshape that is currently underway. This seems like duplication with "chunk_size" and "new_chunk" in mddev_t, and to some extent it is, but there are differences. The values in mddev_t are always defined and often the same. The prev* values are only defined if a reshape is underway. Also (and more significantly) the raid5_conf_t values will be changed at the same time (inside an appropriate lock) that the reshape is started by setting reshape_position. In contrast, the new_chunk value is set when the sysfs file is written which could be well before the reshape starts. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:19:07 +11:00
NeilBrown	86b42c713b	md/raid5: clearly differentiate 'before' and 'after' stripes during reshape. During a raid5 reshape, we have some stripes in the cache that are 'before' the reshape (and are still to be processed) and some that are 'after'. They are currently differentiated by having different ->disks values as the only reshape current supported involves changing the number of disks. However we will soon support reshapes that do not change the number of disks (chunk parity or chunk size). So make the difference more explicit with a 'generation' number. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:19:03 +11:00
NeilBrown	ec32a2bd35	md: allow number of drives in raid5 to be reduced When reshaping a raid5 to have fewer devices, we work from the end of the array to the beginning. md_do_sync gives addresses to sync_request that go from the beginning to the end. So largely ignore them use the internal state variable "reshape_progress" to keep track of what to do next. Never allow the size to be reduced below the minimum (4 for raid6, 3 otherwise). We require that the size of the array has already been reduced before the array is reshaped to a smaller size. This is because simply reducing the size is an easily reversible operation, while the reshape is immediately destructive and so is not reversible for the blocks at the ends of the devices. Thus to reshape an array to have fewer devices, you must first write an appropriately small size to md/array_size. When reshape finished, we remove any drives that are no longer needed and fix up ->degraded. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:17:38 +11:00
NeilBrown	fef9c61fdf	md/raid5: change reshape-progress measurement to cope with reshaping backwards. When reducing the number of devices in a raid4/5/6, the reshape process has to start at the end of the array and work down to the beginning. So we need to handle expand_progress and expand_lo differently. This patch renames "expand_progress" and "expand_lo" to avoid the implication that anything is getting bigger (expand->reshape) and every place they are used, we make sure that they are used the right way depending on whether delta_disks is positive or negative. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:16:46 +11:00
NeilBrown	cea9c22800	md: add explicit method to signal the end of a reshape. Currently raid5 (the only module that supports restriping) notices that the reshape has finished be sync_request being given a large value, and handles any cleanup them. This patch changes it so md_check_recovery calls into an explicit finish_reshape method as well. The clean-up from sync_request can do things that need to be done promptly, typically things local to the raid5_conf_t structure. The "finish_reshape" method is called under the mddev_lock so it can do things involving reconfiguring the device. This allows us to get rid of md_set_array_sectors_locked, which would have caused a deadlock if you tried to stop and array while a reshape was happening. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:15:05 +11:00
NeilBrown	7ec0547838	md/raid5: enhance raid5_size to work correctly with negative delta_disks This is the first of four patches which combine to allow md/raid5 to reduce the number of devices in the array by restriping the data over a subset of the devices. If the number of disks in a raid4/5/6 is being reduced, then the default size must be based on the new number, not the old number of devices. In general, it should be based on the smaller of new and old. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:10:36 +11:00
NeilBrown	34e04e87fb	md/raid5: drop qd_idx from r6_state We now have this value in stripe_head so we don't need to duplicate it. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:10:16 +11:00
Dan Williams	f701d589aa	md/raid6: move raid6 data processing to raid6_pq.ko Move the raid6 data processing routines into a standalone module (raid6_pq) to prepare them to be called from async_tx wrappers and other non-md drivers/modules. This precludes a circular dependency of raid456 needing the async modules for data processing while those modules in turn depend on raid456 for the base level synchronous raid6 routines. To support this move: 1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h 2/ The raid6_call, recovery calls, and table symbols are exported 3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to compile Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:09:39 +11:00
Andre Noll	18b0033491	md: raid5 run(): Fix max_degraded for raid level 4. raid4 allows only one failed disk. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 15:00:56 +11:00
Dan Williams	b522adcde9	md: 'array_size' sysfs attribute Allow userspace to set the size of the array according to the following semantics: 1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0) a) If size is set before the array is running, do_md_run will fail if size is greater than the default size b) A reshape attempt that reduces the default size to less than the set array size should be blocked 2/ once userspace sets the size the kernel will not change it 3/ writing 'default' to this attribute returns control of the size to the kernel and reverts to the size reported by the personality Also, convert locations that need to know the default size from directly reading ->array_sectors to <pers>_size. Resync/reshape operations always follow the default size. Finally, fixup other locations that read a number of 1k-blocks from userspace to use strict_blocks_to_sectors() which checks for unsigned long long to sector_t overflow and blocks to sectors overflow. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-03-31 15:00:31 +11:00
Dan Williams	1f403624bd	md: centralize ->array_sectors modifications Get personalities out of the business of directly modifying ->array_sectors. Lays groundwork to introduce policy on when ->array_sectors can be modified. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-03-31 14:59:03 +11:00
Dan Williams	80c3a6ce4b	md: add 'size' as a personality method In preparation for giving userspace control over ->array_sectors we need to be able to retrieve the 'default' size, and the 'anticipated' size when a reshape is requested. For personalities that do not reshape emit a warning if anything but the default size is requested. In the raid5 case we need to update ->previous_raid_disks to make the new 'default' size available. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-03-31 14:57:49 +11:00
Atsushi SAKAI	93ed05e2a5	md: fix typo in FSF address Hello, I found a typo Bosto"m" in FSF address. And I am checking around linux source code. Here is the only place which uses Bosto"m" (not Boston). Signed-off-by: Atsushi SAKAI <sakaia@jp.fujitsu.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:57:37 +11:00
NeilBrown	fc9739c6d6	md: add takeover support for converting raid6 back into raid5 If a raid6 is still in the layout that comes from converting raid5 into a raid6. this will allow us to convert it back again. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:57:20 +11:00
NeilBrown	e9d4758f6e	md: add takeover support for raid4 -> raid5 conversion. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:57:09 +11:00
NeilBrown	b354603527	md/raid5: allow layout/chunksize to be changed on an active 2-drive raid5. 2-drive raid5's aren't very interesting. But if you are converting a raid1 into a raid5, you will at least temporarily have one. And that it a good time to set the layout/chunksize for the new RAID5 if you aren't happy with the defaults. layout and chunksize don't actually affect the placement of data on a 2-drive raid5, so we just do some internal book-keeping. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:56:41 +11:00
NeilBrown	d562b0c431	md: add ->takeover method for raid5 to be able to take over raid1 The RAID1 must have two drives and be a suitable size to be a multiple of a chunksize that isn't too small. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:39 +11:00
NeilBrown	245f46c2c2	md: add ->takeover method to support changing the personality managing an array Implement this for RAID6 to be able to 'takeover' a RAID5 array. The new RAID6 will use a layout which places Q on the last device, and that device will be missing. If there are any available spares, one will immediately have Q recovered onto it. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:39 +11:00
NeilBrown	409c57f380	md: enable suspend/resume of md devices. To be able to change the 'level' of an md/raid array, we need to suspend the device so that no requests are active - then move some pointers around etc. The code already keeps counts of active requests and the ->quiesce function can be used to wait until those counts hit zero. However the quiesce function blocks new requests once they are all ready 'inside' the personality module, and that is too late if we want to replace the personality modules. So make all md requests come in through a common md_make_request function that keeps track of how many requests have entered the modules but may not yet be on the internal reference counts. Allow md_make_request to be blocked when we want to suspend the device, and make it possible to wait for all those in-transit requests to be added to internal lists so that ->quiesce can wait for them. There is still a problem that when a request completes, we drop the ref count inside the personality code so there is a short time between when the refcount hits zero, and when the personality code is no longer being used. The personality code never blocks (schedule or spinlock) between dropping the refcount and exiting the routine, so this should be safe (as put_module calls synchronize_sched() before unmapping the module code). Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:39 +11:00
NeilBrown	e0cf8f045b	md: md_unregister_thread should cope with being passed NULL Mostly md_unregister_thread is only called when we know that the thread is NULL, but sometimes we need to check first. It is safer to put the check inside md_unregister_thread itself. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:39 +11:00
NeilBrown	91adb56473	md/raid5: refactor raid5 "run" .. so that the code to create the private data structures is separate. This will help with future code to change the level of an active array. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:39 +11:00
NeilBrown	34817e8c39	md: make sure new_level, new_chunksize, new_layout always have sensible values. When an md array is undergoing a change, we have new_* fields that show the new values. When no change is happening, it is least confusing if these have the same value as the normal fields. This is true in most cases, but not when the values are set via sysfs. So fix this up. A subsequent patch will BUG_ON if these things aren't consistent. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
NeilBrown	67cc2b8165	md/raid5: finish support for DDF/raid6 DDF requires RAID6 calculations over different devices in a different order. For md/raid6, we calculate over just the data devices, starting immediately after the 'Q' block. For ddf/raid6 we calculate over all devices, using zeros in place of the P and Q blocks. This requires unfortunately complex loops... Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
NeilBrown	99c0fb5f92	md/raid5: Add support for new layouts for raid5 and raid6. DDF uses different layouts for P and Q blocks than current md/raid6 so add those that are missing. Also add support for RAID6 layouts that are identical to various raid5 layouts with the simple addition of one device to hold all of the 'Q' blocks. Finally add 'raid5' layouts to match raid4. These last to will allow online level conversion. Note that this does not provide correct support for DDF/raid6 yet as the order in which data blocks are summed to produce the Q block is significant and different between current md code and DDF requirements. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
NeilBrown	911d4ee853	md/raid5: simplify raid5_compute_sector interface Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass a 'struct stripe_head *' and fill in the relevant fields. This is more extensible. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
NeilBrown	d0dabf7e57	md/raid6: remove expectation that Q device is immediately after P device. Code currently assumes that the devices in a raid6 stripe are 0 1 ... N-1 P Q in some rotated order. We will shortly add new layouts in which this strict pattern is broken. So remove this expectation. We still assume that the data disks are roughly in-order. However P and Q can be inserted anywhere within that order. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
NeilBrown	112bf8970d	md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument This similar to the recent change to get_active_stripe. There is no functional change, just come rearrangement to make future patches cleaner. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
NeilBrown	b5663ba405	md/raid5: simplify interface for init_stripe and get_active_stripe Rather than passing 'pd_idx' and 'disks' to these functions, just pass 'previous' which tells whether to use the 'previous' or 'current' geometry during a reshape, and let init_stripe calculate disks and pd_idx and anything else it might need. This is not a substantial simplification and even adds a division. However we will shortly be adding more complexity to init_stripe to handle more interesting 'reshape' activities, and without this change, the interface to these functions would get very complex. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
Andre Noll	dd8ac336c1	md: Represent raid device size in sectors. This patch renames the "size" field of struct mdk_rdev_s to "sectors" and changes this field to store sectors instead of blocks. All users of this field, linear.c, raid0.c and md.c, are fixed up accordingly which gets rid of many multiplications and divisions. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
Andre Noll	58c0fed400	md: Make mddev->size sector-based. This patch renames the "size" field of struct mddev_s to "dev_sectors" and stores the number of 512-byte sectors instead of the number of 1K-blocks in it. All users of that field, including raid levels 1,4-6,10, are adjusted accordingly. This simplifies the code a bit because it allows to get rid of a couple of divisions/multiplications by two. In order to make checkpatch happy, some minor coding style issues have also been addressed. In particular, size_store() now uses strict_strtoull() instead of simple_strtoull(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	575a80fa4f	md: be more consistent about setting WriteMostly flag when adding a drive to an array When a drive is added to an array using ADD_NEW_DISK, there are two places we can get certain flags from: the metadata on the disk or the flags passed through the IOCTL. For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value from either of those sources depending on if it is set (i.e. we effectively 'or' the two sources together). This makes it awkward to clear, and is at best inconsistent. As documented code (in mdadm) requires that setting MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the inconsistency by always using the value for this flag from the ioctl, and ignoring the value on disk. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	97e4f42d62	md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash Version 1.x metadata has the ability to record the status of a partially completed drive recovery. However we only update that record on a clean shutdown. It would be nice to update it on unclean shutdowns too, particularly when using a bitmap that removes much to the 'sync' effort after an unclean shutdown. One complication with checkpointing recovery is that we only know where we are up to in terms of IO requests started, not which ones have completed. And we need to know what has completed to record how much is recovered. So occasionally pause the recovery until all submitted requests are completed, then update the record of where we are up to. When we have a bitmap, we already do that pause occasionally to keep the bitmap up-to-date. So enhance that code to record the recovery offset and schedule a superblock update. And when there is no bitmap, just pause 16 times during the resync to do a checkpoint. '16' is a fairly arbitrary number. But we don't really have any good way to judge how often is acceptable, and it seems like a reasonable number for now. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	43b2e5d86d	md: move md_k.h from include/linux/raid/ to drivers/md/ It really is nicer to keep related code together.. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	bff61975b3	md: move lots of #include lines out of .h files and into .c This makes the includes more explicit, and is preparation for moving md_k.h to drivers/md/md.h Remove include/raid/md.h as its only remaining use was to #include other files. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	8b2b5c217c	md: move LEVEL_* definition from md_k.h to md_u.h .. as they are part of the user-space interface. Also move MdpMinorShift into there so we can remove duplication. Lastly move mdp_major in. It is less obviously part of the user-space interface, but do_mounts_md.c uses it, and it is acting a bit like user-space. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:03 +11:00
Christoph Hellwig	ef740c372d	md: move headers out of include/linux/raid/ Move the headers with the local structures for the disciplines and bitmap.h into drivers/md/ so that they are more easily grepable for hacking and not far away. md.h is left where it is for now as there are some uses from the outside. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:03 +11:00
Christoph Hellwig	2a40a8aed0	cleanup drivers/md/Makefile Use the -y variables instead of the old -objs so we can easily add conditional objects to the modules. Also always use += to add subobjects to avoid problems when placing additional objects in some place in the file. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
Christoph Hellwig	3dbd8c2e3f	md: stop defining MAJOR_NR MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier kernels, so no need to keep it around. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
Martin K. Petersen	3f9d99c12a	MD data integrity support md: Add support for data integrity to MD If all subdevices support the same protection format the MD device is flagged as integrity capable. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
NeilBrown	355a43e641	md: write bitmap information to devices that are undergoing recovery. When we add some spares to an array and start recovery, and we have a bitmap which is stored 'internally' on all devices, we call bitmap_write_all to make sure the bitmap is correct on the new device(s). However that doesn't work as write_sb_page only writes to 'In_sync' devices, and devices undergoing recovery are not 'In_sync' until recovery finishes. So extend write_sb_page (actually next_active_rdev) to include devices that are under recovery. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
NeilBrown	d0a4bb4927	md: never clear bit from the write-intent bitmap when the array is degraded. It is safe to clear a bit from the write-intent bitmap for a raid1 if we know the data has been written to all devices, which is what the current test does. But it is not always safe to update the 'events_cleared' counter in that case. This is because one request could complete successfully after some other request has partially failed. So simply disable the clearing and updating of events_cleared whenever the array is degraded. This might end up not clearing some bits that could safely be cleared, but it is safest approach. Note that the bug fixed here did not risk corrupting data by letting the array get out-of-sync. Rather it meant that when a device is removed and re-added to the array, it might incorrectly require a full recovery rather than just recovering based on the bitmap. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
NeilBrown	1187cf0a3c	md: Allow write-intent bitmaps to have chunksize < PAGE_SIZE md currently insists that the chunk size used for write-intent bitmaps (the amount of data that corresponds to one chunk) be at least one page. The reason for this restriction is lost in the mists of time, but a review of the code (and a vague memory) suggests that the only problem would be related to resync. Resync tries very hard to work in multiples of a page, but also needs to sync with units of a bitmap_chunk too. This connection comes out in the bitmap_start_sync call. So change bitmap_start_sync to always work in multiples of a page. If the bitmap chunk size is less that one page, we flag multiple chunks as 'syncing' and generally make them all appear to the resync routines like one chunk. All other code either already works with data ranges that could span multiple chunks, or explicitly only cares about a single chunk. Signed-off-by: Neil Brown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
NeilBrown	eea1bf384e	md: Fix is_mddev_idle test (again). There are two problems with is_mddev_idle. 1/ sync_io is 'atomic_t' and hence 'int'. curr_events and all the rest are 'long'. So if sync_io were to wrap on a 64bit host, the value of curr_events would go very negative suddenly, and take a very long time to return to positive. So do all calculations as 'int'. That gives us plenty of precision for what we need. 2/ To initialise rdev->last_events we simply call is_mddev_idle, on the assumption that it will make sure that last_events is in a suitable range. It used to do this, but now it does not. So now we need to be more explicit about initialisation. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
Milan Broz	b35f8caa08	dm crypt: wait for endio to complete before destruction The following oops has been reported when dm-crypt runs over a loop device. ... [ 70.381058] Process loop0 (pid: 4268, ti=cf3b2000 task=cf1cc1f0 task.ti=cf3b2000) ... [ 70.381058] Call Trace: [ 70.381058] [<d0d76601>] ? crypt_dec_pending+0x5e/0x62 [dm_crypt] [ 70.381058] [<d0d767b8>] ? crypt_endio+0xa2/0xaa [dm_crypt] [ 70.381058] [<d0d76716>] ? crypt_endio+0x0/0xaa [dm_crypt] [ 70.381058] [<c01a2f24>] ? bio_endio+0x2b/0x2e [ 70.381058] [<d0806530>] ? dec_pending+0x224/0x23b [dm_mod] [ 70.381058] [<d08066e4>] ? clone_endio+0x79/0xa4 [dm_mod] [ 70.381058] [<d080666b>] ? clone_endio+0x0/0xa4 [dm_mod] [ 70.381058] [<c01a2f24>] ? bio_endio+0x2b/0x2e [ 70.381058] [<c02bad86>] ? loop_thread+0x380/0x3b7 [ 70.381058] [<c02ba8a1>] ? do_lo_send_aops+0x0/0x165 [ 70.381058] [<c013754f>] ? autoremove_wake_function+0x0/0x33 [ 70.381058] [<c02baa06>] ? loop_thread+0x0/0x3b7 When a table is being replaced, it waits for I/O to complete before destroying the mempool, but the endio function doesn't call mempool_free() until after completing the bio. Fix it by swapping the order of those two operations. The same problem occurs in dm.c with md referenced after dec_pending. Again, we swap the order. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-03-16 17:44:36 +00:00
Huang Ying	b2174eebd1	dm crypt: fix kcryptd_async_done parameter In the async encryption-complete function (kcryptd_async_done), the crypto_async_request passed in may be different from the one passed to crypto_ablkcipher_encrypt/decrypt. Only crypto_async_request->data is guaranteed to be same as the one passed in. The current kcryptd_async_done uses the passed-in crypto_async_request directly which may cause the AES-NI-based AES algorithm implementation to panic. This patch fixes this bug by only using crypto_async_request->data, which points to dm_crypt_request, the crypto_async_request passed in. The original data (convert_context) is gotten from dm_crypt_request. [mbroz@redhat.com: reworked] Cc: stable@kernel.org Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-03-16 17:44:33 +00:00
Mikulas Patocka	d659e6cc98	dm io: respect BIO_MAX_PAGES limit dm-io calls bio_get_nr_vecs to get the maximum number of pages to use for a given device. It allocates one additional bio_vec to use internally but failed to respect BIO_MAX_PAGES, so fix this. This was the likely cause of: https://bugzilla.redhat.com/show_bug.cgi?id=173153 Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-03-16 17:44:30 +00:00
Mikulas Patocka	f80a557008	dm table: rework reference counting fix Fix an error introduced in dm-table-rework-reference-counting.patch. When there is failure after table initialization, we need to use dm_table_destroy, not dm_table_put, to free the table. dm_table_put may be used only after dm_table_get. Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-03-16 17:44:26 +00:00
Milan Broz	bc0fd67feb	dm ioctl: validate name length when renaming When renaming a mapped device validate the length of the new name. The rename ioctl accepted any correctly-terminated string enclosed within the data passed from userspace. The other ioctls enforce a size limit of DM_NAME_LEN. If the name is changed and becomes longer than that, the device can no longer be addressed by name. Fix it by properly checking for device name length (including terminating zero). Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-03-16 16:56:01 +00:00
Dan Williams	5fd3a17ed4	md: fix deadlock when stopping arrays Resolve a deadlock when stopping redundant arrays, i.e. ones that require a call to sysfs_remove_group when shutdown. The deadlock is summarized below: Thread1 Thread2 ------- ------- read sysfs attribute stop array take mddev lock sysfs_remove_group sysfs_get_active wait for mddev lock wait for active Sysrq-w: -------- mdmon S 00000017 2212 4163 1 f1982ea8 00000046 2dcf6b85 00000017 c0b23100 f2f83ed0 c0b23100 f2f8413c c0b23100 c0b23100 c0b1fb98 f2f8413c 00000000 f2f8413c c0b23100 f2291ecc 00000002 c0b23100 00000000 00000017 f2f83ed0 f1982eac 00000046 c044d9dd Call Trace: [<c044d9dd>] ? debug_mutex_add_waiter+0x1d/0x58 [<c06ef451>] __mutex_lock_common+0x1d9/0x338 [<c06ef451>] ? __mutex_lock_common+0x1d9/0x338 [<c06ef5e3>] mutex_lock_interruptible_nested+0x33/0x3a [<c0634553>] ? mddev_lock+0x14/0x16 [<c0634553>] mddev_lock+0x14/0x16 [<c0634eda>] md_attr_show+0x2a/0x49 [<c04e9997>] sysfs_read_file+0x93/0xf9 mdadm D 00000017 2812 4177 1 f0401d78 00000046 430456f8 00000017 f0401d58 f0401d20 c0b23100 f2da2c4c c0b23100 c0b23100 c0b1fb98 f2da2c4c 0a10fc36 00000000 c0b23100 f0401d70 00000003 c0b23100 00000000 00000017 f2da29e0 00000001 00000002 00000000 Call Trace: [<c06eed1b>] schedule_timeout+0x1b/0x95 [<c06eed1b>] ? schedule_timeout+0x1b/0x95 [<c06eeb97>] ? wait_for_common+0x34/0xdc [<c044fa8a>] ? trace_hardirqs_on_caller+0x18/0x145 [<c044fbc2>] ? trace_hardirqs_on+0xb/0xd [<c06eec03>] wait_for_common+0xa0/0xdc [<c0428c7c>] ? default_wake_function+0x0/0x12 [<c06eeccc>] wait_for_completion+0x17/0x19 [<c04ea620>] sysfs_addrm_finish+0x19f/0x1d1 [<c04e920e>] sysfs_hash_and_remove+0x42/0x55 [<c04eb4db>] sysfs_remove_group+0x57/0x86 [<c0638086>] do_md_stop+0x13a/0x499 This has been there for a while, but is easier to trigger now that mdmon is closely watching sysfs. Cc: <stable@kernel.org> Reported-by: Jacek Danecki <jacek.danecki@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-03-04 00:57:25 -07:00
NeilBrown	73d5c38a95	md: avoid races when stopping resync. There has been a race in raid10 and raid1 for a long time which has only recently started showing up due to a scheduler changed. When a sync_read request finishes, as soon as reschedule_retry is called, another thread can mark the resync request as having completed, so md_do_sync can finish, ->stop can be called, and ->conf can be freed. So using conf after reschedule_retry is not safe. Similarly, when finishing a sync_write, calling md_done_sync must be the last thing we do, as it allows a chain of events which will free conf and other data structures. The first of these requires action in raid10.c The second requires action in raid1.c and raid10.c Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-25 13:18:47 +11:00
NeilBrown	78200d45cd	md/raid10: Don't call bitmap_cond_end_sync when we are doing recovery. For raid1/4/5/6, resync (fixing inconsistencies between devices) is very similar to recovery (rebuilding a failed device onto a spare). The both walk through the device addresses in order. For raid10 it can be quite different. resync follows the 'array' address, and makes sure all copies are the same. Recover walks through 'device' addresses and recreates each missing block. The 'bitmap_cond_end_sync' function allows the write-intent-bitmap (When present) to be updated to reflect a partially completed resync. It makes assumptions which mean that it does not work correctly for raid10 recovery at all. In particularly, it can cause bitmap-directed recovery of a raid10 to not recovery some of the blocks that need to be recovered. So move the call to bitmap_cond_end_sync into the resync path, rather than being in the common "resync or recovery" path. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-25 13:18:47 +11:00
NeilBrown	09b4068a7f	md/raid10: Don't skip more than 1 bitmap-chunk at a time during recovery. When doing recovery on a raid10 with a write-intent bitmap, we only need to recovery chunks that are flagged in the bitmap. However if we choose to skip a chunk as it isn't flag, the code currently skips the whole raid10-chunk, thus it might not recovery some blocks that need recovering. This patch fixes it. In case that is confusing, it might help to understand that there is a 'raid10 chunk size' which guides how data is distributed across the devices, and a 'bitmap chunk size' which says how much data corresponds to a single bit in the bitmap. This bug only affects cases where the bitmap chunk size is smaller than the raid10 chunk size. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-25 13:18:47 +11:00
Jens Axboe	93dbb39350	block: fix bad definition of BIO_RW_SYNC We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before `213d9417fe`. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-02-18 10:32:00 +01:00
NeilBrown	de01dfadf2	md: Ensure an md array never has too many devices. Each different metadata format supported by md supports a different maximum number of devices. We really should be enforcing this maximum in the kernel, but we aren't quite doing that properly. We currently only enforce it at the 'hot_add' point, which is an older interface which is not used by current userspace. We need to also enforce it at 'add_new_disk' time for active arrays and at 'do_md_run' time when starting a new array. So move the test from 'hot_add' into 'bind_rdev_to_array' which is called from both 'hot_add' and 'add_new_disk, and add a new test in 'analyse_sbs' which is called from 'do_md_run'. This bug (or missing feature) has been around "forever" and so the patch is suitable for any -stable that is currently maintained. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-06 18:02:46 +11:00
Andre Noll	852c8bf484	md: Fix a bug in linear.c causing which_dev() to return the wrong device. `ab5bd5cbc8` introduced the following bug in linear software raid for large arrays on 32 bit machines: which_dev() computes the device holding a given sector by shifting down the sector number to a 32 bit range, dividing by the array spacing and looking up the resulting index in the hash table of the array. Because the computed index might be slightly too small, a loop at the end of which_dev() increases the index until the given sector actually falls into the range of the device associated with that index. The changes of the above mentioned commit caused this loop to check whether the _index_ rather than the sector number is small enough, effectively bypassing the loop and thus possibly returning the wrong device. As reported by Simon Kirby, this leads to errors such as linear_make_request: Sector 2340486136 out of bounds on dev sdi: 156301312 sectors, offset 2109870464 Fix this bug by introducing a local variable for the index so that the variable containing the passed sector is left unchanged. Cc: stable@kernel.org Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-06 15:10:52 +11:00
NeilBrown	4706b349f4	md: Allow read error in a single drive raid1 to be passed up. If a raid1 only has a single working device and gets a read error, we choose to simply return that error up to the filesystem (or whatever) rather than failing the whole array. However the codes doesn't quite do that. We attempt a readbalance which allocates the same drive, so we retry the read - indefinitely. Instead: If read_balance in the error case chooses the same drive that just failed, treat it as a failure and don't retry. Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-06 15:06:47 +11:00
NeilBrown	4044ba58dd	md: don't retry recovery of raid1 that fails due to error on source drive. If a raid1 has only one working drive and it has a sector which gives an error on read, then an attempt to recover onto a spare will fail, but as the single remaining drive is not removed from the array, the recovery will be immediately re-attempted, resulting in an infinite recovery loop. So detect this situation and don't retry recovery once an error on the lone remaining drive is detected. Allow recovery to be retried once every time a spare is added in case the problem wasn't actually a media error. Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:11 +11:00
NeilBrown	efeb53c0e5	md: Allow md devices to be created by name. Using sequential numbers to identify md devices is somewhat artificial. Using names can be a lot more user-friendly. Also, creating md devices by opening the device special file is a bit awkward. So this patch provides a new option for creating and naming devices. Writing a name such as "md_home" to /sys/modules/md_mod/parameters/new_array will cause an array with that name to be created. It will appear in /sys/block/ /proc/partitions and /proc/mdstat as 'md_home'. It will have an arbitrary minor number allocated. md devices that a created by an open are destroyed on the last close when the device is inactive. For named md devices, they will not be destroyed until the array is explicitly stopped, either with the STOP_ARRAY ioctl or by writing 'clear' to /sys/block/md_XXXX/md/array_state. The name of the array must start 'md_' to avoid conflict with other devices. Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:10 +11:00
NeilBrown	d3374825ce	md: make devices disappear when they are no longer needed. Currently md devices, once created, never disappear until the module is unloaded. This is essentially because the gendisk holds a reference to the mddev, and the mddev holds a reference to the gendisk, this a circular reference. If we drop the reference from mddev to gendisk, then we need to ensure that the mddev is destroyed when the gendisk is destroyed. However it is not possible to hook into the gendisk destruction process to enable this. So we drop the reference from the gendisk to the mddev and destroy the gendisk when the mddev gets destroyed. However this has a complication. Between the call __blkdev_get->get_gendisk->kobj_lookup->md_probe and the call __blkdev_get->md_open there is no obvious way to hold a reference on the mddev any more, so unless something is done, it will disappear and gendisk will be destroyed prematurely. Also, once we decide to destroy the mddev, there will be an unlockable moment before the gendisk is unlinked (blk_unregister_region) during which a new reference to the gendisk can be created. We need to ensure that this reference can not be used. i.e. the ->open must fail. So: 1/ in md_probe we set a flag in the mddev (hold_active) which indicates that the array should be treated as active, even though there are no references, and no appearance of activity. This is cleared by md_release when the device is closed if it is no longer needed. This ensures that the gendisk will survive between md_probe and md_open. 2/ In md_open we check if the mddev we expect to open matches the gendisk that we did open. If there is a mismatch we return -ERESTARTSYS and modify __blkdev_get to retry from the top in that case. In the -ERESTARTSYS sys case we make sure to wait until the old gendisk (that we succeeded in opening) is really gone so we loop at most once. Some udev configurations will always open an md device when it first appears. If we allow an md device that was just created by an open to disappear on an immediate close, then this can race with such udev configurations and result in an infinite loop the device being opened and closed, then re-open due to the 'ADD' even from the first open, and then close and so on. So we make sure an md device, once created by an open, remains active at least until some md 'ioctl' has been made on it. This means that all normal usage of md devices will allow them to disappear promptly when not needed, but the worst that an incorrect usage will do it cause an inactive md device to be left in existence (it can easily be removed). As an array can be stopped by writing to a sysfs attribute echo clear > /sys/block/mdXXX/md/array_state we need to use scheduled work for deleting the gendisk and other kobjects. This allows us to wait for any pending gendisk deletion to complete by simply calling flush_scheduled_work(). Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:10 +11:00
NeilBrown	a21d15042d	md: centralise all freeing of an 'mddev' in 'md_free' md_free is the .release handler for the md kobj_type. So it makes sense to release all the objects referenced by the mddev in there, rather than just prior to calling kobject_put for what we think is the last time. Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:09 +11:00
NeilBrown	8b76539823	md: move allocation of ->queue from mddev_find to md_probe It is more balanced to just do simple initialisation in mddev_find, which allocates and links a new md device, and leave all the more sophisticated allocation to md_probe (which calls mddev_find). md_probe already allocated the gendisk. It should allocate the queue too. Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:08 +11:00
Cheng Renquan	cd2ac9321c	md: need another print_sb for mdp_superblock_1 md_print_devices is called in two code path: MD_BUG(...), and md_ioctl with PRINT_RAID_DEBUG. it will dump out all in use md devices information; However, it wrongly processed two types of superblock in one: The header file <linux/raid/md_p.h> has defined two types of superblock, struct mdp_superblock_s (typedefed with mdp_super_t) according to md with metadata 0.90, and struct mdp_superblock_1 according to md with metadata 1.0 and later, These two types of superblock are very different, The md_print_devices code processed them both in mdp_super_t, that would lead to wrong informaton dump like: [ 6742.345877] [ 6742.345887] md: ********************************** [ 6742.345890] md: * <COMPLETE RAID STATE PRINTOUT> * [ 6742.345892] md: ******************************** [ 6742.345896] md1: <ram7><ram6><ram5><ram4> [ 6742.345907] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3 [ 6742.345909] md: rdev superblock: [ 6742.345914] md: SB: (V:0.90.0) ID:<42ef13c7.598c059a.5f9f1645.801e9ee6> CT:4919856d [ 6742.345918] md: L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536 [ 6742.345922] md: UT:4919856d ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:b7992907 E:00000001 [ 6742.345924] D 0: DISK<N:0,(1,8),R:0,S:6> [ 6742.345930] D 1: DISK<N:1,(1,10),R:1,S:6> [ 6742.345933] D 2: DISK<N:2,(1,12),R:2,S:6> [ 6742.345937] D 3: DISK<N:3,(1,14),R:3,S:6> [ 6742.345942] md: THIS: DISK<N:3,(1,14),R:3,S:6> ... [ 6742.346058] md0: <ram3><ram2><ram1><ram0> [ 6742.346067] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3 [ 6742.346070] md: rdev superblock: [ 6742.346073] md: SB: (V:1.0.0) ID:<369aad81.00000000.00000000.00000000> CT:9a322a9c [ 6742.346077] md: L-1507699579 S976570180 ND:48 RD:0 md0 LO:65536 CS:196610 [ 6742.346081] md: UT:00000018 ST:0 AD:131048 WD:0 FD:8 SD:0 CSUM:00000000 E:00000000 [ 6742.346084] D 0: DISK<N:-1,(-1,-1),R:-1,S:-1> [ 6742.346089] D 1: DISK<N:-1,(-1,-1),R:-1,S:-1> [ 6742.346092] D 2: DISK<N:-1,(-1,-1),R:-1,S:-1> [ 6742.346096] D 3: DISK<N:-1,(-1,-1),R:-1,S:-1> [ 6742.346102] md: THIS: DISK<N:0,(0,0),R:0,S:0> ... [ 6742.346219] md: ****************************** [ 6742.346221] Here md1 is metadata 0.90.0, and md0 is metadata 1.2 After some more code to distinguish these two types of superblock, in this patch, it will generate dump information like: [ 7906.755790] [ 7906.755799] md: ******************************** [ 7906.755802] md: * <COMPLETE RAID STATE PRINTOUT> * [ 7906.755804] md: ******************************** [ 7906.755808] md1: <ram7><ram6><ram5><ram4> [ 7906.755819] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3 [ 7906.755821] md: rdev superblock (MJ:0): [ 7906.755826] md: SB: (V:0.90.0) ID:<3fca7a0d.a612bfed.5f9f1645.801e9ee6> CT:491989f3 [ 7906.755830] md: L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536 [ 7906.755834] md: UT:491989f3 ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:00fb52ad E:00000001 [ 7906.755836] D 0: DISK<N:0,(1,8),R:0,S:6> [ 7906.755842] D 1: DISK<N:1,(1,10),R:1,S:6> [ 7906.755845] D 2: DISK<N:2,(1,12),R:2,S:6> [ 7906.755849] D 3: DISK<N:3,(1,14),R:3,S:6> [ 7906.755855] md: THIS: DISK<N:3,(1,14),R:3,S:6> ... [ 7906.755972] md0: <ram3><ram2><ram1><ram0> [ 7906.755981] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3 [ 7906.755984] md: rdev superblock (MJ:1): [ 7906.755989] md: SB: (V:1) (F:0) Array-ID:<5fbcf158:55aa:5fbe:9a79:1e939880dcbd> [ 7906.755990] md: Name: "DG5:0" CT:1226410480 [ 7906.755998] md: L5 SZ130944 RD:4 LO:2 CS:128 DO:24 DS:131048 SO:8 RO:0 [ 7906.755999] md: Dev:00000003 UUID: 9194d744:87f7:a448:85f2:7497b84ce30a [ 7906.756001] md: (F:0) UT:1226410480 Events:0 ResyncOffset:-1 CSUM:0dbcd829 [ 7906.756003] md: (MaxDev:384) ... [ 7906.756113] md: ******************************** [ 7906.756116] this md0 (metadata 1.2) information dumping is exactly according to struct mdp_superblock_1. Signed-off-by: Cheng Renquan <crquan@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Dan Williams <dan.j.williams@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:08 +11:00
Cheng Renquan	159ec1fc06	md: use list_for_each_entry macro directly The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to list_for_each_entry_safe, from <linux/list.h>, it should be defined to use list_for_each_entry_safe, instead of reinventing the wheel. But some calls to each_entry_safe don't really need a safe version, just a direct list_for_each_entry is enough, this could save a temp variable (tmp) in every function that used rdev_for_each. In this patch, most rdev_for_each loops are replaced by list_for_each_entry, totally save many tmp vars; and only in the other situations that will call list_del to delete an entry, the safe version is used. Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:08 +11:00
Andre Noll	ccacc7d2cf	md: raid0: make hash_spacing and preshift sector-based. This patch renames the hash_spacing and preshift members of struct raid0_private_data to spacing and sector_shift respectively and changes the semantics as follows: We always have spacing = 2 * hash_spacing. In case sizeof(sector_t) > sizeof(u32) we also have sector_shift = preshift + 1 while sector_shift = preshift = 0 otherwise. Note that the values of nb_zone and zone are unaffected by these changes because in the sector_div() preceeding the assignement of these two variables both arguments double. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:08 +11:00
Andre Noll	83838ed878	md: raid0: Represent the size of strip zones in sectors. This completes the block -> sector conversion of struct strip_zone. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:07 +11:00
Andre Noll	0825b87a7d	md: raid0 create_strip_zones(): Add KERN_INFO/KERN_ERR to printk's. This patch consists only of these trivial changes. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:07 +11:00
Andre Noll	6b8796cc3d	md: raid0 create_strip_zones(): Make two local variables sector-based. current_offset and curr_zone_offset stored the corresponding offsets as 1K quantities. Rename them to current_start and curr_zone_start to match the naming of struct strip_zone and store the offsets as sector counts. Also, add KERN_INFO to the printk() affected by this change to make checkpatch happy. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:07 +11:00
Andre Noll	6199d3db0f	md: raid0: Represent zone->zone_offset in sectors. For the same reason as in the previous patch, rename it from zone_offset to zone_start. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:07 +11:00
Andre Noll	019c4e2f3e	md: raid0: Represent device offset in sectors. Rename zone->dev_offset to zone->dev_start to make sure all users have been converted. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:06 +11:00
Andre Noll	e0f0686834	md: raid0_make_request(): Replace local variable block by sector. This change already simplifies the code a bit. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:06 +11:00
Andre Noll	a471200595	md: raid0_make_request(): Remove local variable chunk_size. We might as well use chunk_sects instead. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:06 +11:00
Andre Noll	1b7fdf8ff7	md: raid0_make_request(): Replace chunksize_bits by chunksect_bits. As ffz(~(2 * x)) = ffz(~x) + 1, we have chunksect_bits = chunksize_bits + 1. Fixup all users accordingly. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:06 +11:00
NeilBrown	0c3573f19d	md: use sysfs_notify_dirent to notify changes to md/sync_action. There is no compelling need for this, but sysfs_notify_dirent is a nicer interface and the change is good for consistency. Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:05 +11:00
NeilBrown	538452700d	md: fix bitmap-on-external-file bug. commit `a2ed9615e3` fixed a bug with 'internal' bitmaps, but in the process broke 'in a file' bitmaps. So they are broken in 2.6.28 This fixes it, and needs to go in 2.6.28-stable. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2009-01-09 08:31:05 +11:00
Jonathan Brassow	a159c1ac5f	dm snapshot: extend exception store functions Supply dm_add_exception as a callback to the read_metadata function. Add a status function ready for a later patch and name the functions consistently. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:19 +00:00
Alasdair G Kergon	4db6bfe02b	dm snapshot: split out exception store implementations Move the existing snapshot exception store implementations out into separate files. Later patches will place these behind a new interface in preparation for alternative implementations. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:17 +00:00
Jonathan Brassow	1ae25f9c93	dm snapshot: rename struct exception_store Rename struct exception_store to dm_exception_store. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:16 +00:00
Jonathan Brassow	aea53d92f7	dm snapshot: separate out exception store interface Pull structures that bridge the gap between snapshot and exception store out of dm-snap.h and put them in a new .h file - dm-exception-store.h. This file will define the API for new exception stores. Ultimately, dm-snap.h is unnecessary, since only dm-snap.c should be using it. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:15 +00:00
Alasdair G Kergon	fe9cf30eb8	dm mpath: move trigger_event to system workqueue The same workqueue is used both for sending uevents and processing queued I/O. Deadlock has been reported in RHEL5 when sending a uevent was blocked waiting for the queued I/O to be processed. Use scheduled_work() for the asynchronous uevents instead. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:13 +00:00
Milan Broz	784aae735d	dm: add name and uuid to sysfs Implement simple read-only sysfs entry for device-mapper block device. This patch adds a simple sysfs directory named "dm" under block device properties and implements - name attribute (string containing mapped device name) - uuid attribute (string containing UUID, or empty string if not set) The kobject is embedded in mapped_device struct, so no additional memory allocation is needed for initializing sysfs entry. During the processing of sysfs attribute we need to lock mapped device which is done by a new function dm_get_from_kobj, which returns the md associated with kobject and increases the usage count. Each 'show attribute' function is responsible for its own locking. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:12 +00:00
Mikulas Patocka	d58168763f	dm table: rework reference counting Rework table reference counting. The existing code uses a reference counter. When the last reference is dropped and the counter reaches zero, the table destructor is called. Table reference counters are acquired/released from upcalls from other kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all). If the reference counter reaches zero in one of the upcalls, the table destructor is called from almost random kernel code. This leads to various problems: * dm_any_congested being called under a spinlock, which calls the destructor, which calls some sleeping function. * the destructor attempting to take a lock that is already taken by the same process. * stale reference from some other kernel code keeps the table constructed, which keeps some devices open, even after successful return from "dmsetup remove". This can confuse lvm and prevent closing of underlying devices or reusing device minor numbers. The patch changes reference counting so that the table destructor can be called only at predetermined places. The table has always exactly one reference from either mapped_device->map or hash_cell->new_map. After this patch, this reference is not counted in table->holders. A pair of dm_create_table/dm_destroy_table functions is used for table creation/destruction. Temporary references from the other code increase table->holders. A pair of dm_table_get/dm_table_put functions is used to manipulate it. When the table is about to be destroyed, we wait for table->holders to reach 0. Then, we call the table destructor. We use active waiting with msleep(1), because the situation happens rarely (to one user in 5 years) and removing the device isn't performance-critical task: the user doesn't care if it takes one tick more or not. This way, the destructor is called only at specific points (dm_table_destroy function) and the above problems associated with lazy destruction can't happen. Finally remove the temporary protection added to dm_any_congested(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:10 +00:00
Andi Kleen	ab4c142488	dm: support barriers on simple devices Implement barrier support for single device DM devices This patch implements barrier support in DM for the common case of dm linear just remapping a single underlying device. In this case we can safely pass the barrier through because there can be no reordering between devices. NB. Any DM device might cease to support barriers if it gets reconfigured so code must continue to allow for a possible -EOPNOTSUPP on every barrier bio submitted. - agk Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:09 +00:00
Kiyoshi Ueda	8fbf26ad5b	dm request: add caches This patch prepares some kmem_caches for request-based dm. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:06 +00:00
Milan Broz	23d39f63aa	dm ioctl: allow dm_copy_name_and_uuid to return only one field Allow NULL buffer in dm_copy_name_and_uuid if you only want to return one of the fields. (Required by a following patch that adds these fields to sysfs.) Signed-off-by: Milan Broz <mbroz@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:04 +00:00
Milan Broz	ac1f0ac22c	dm log: ensure log bitmap fits on log device Check that the log bitmap will fit within the log device. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:02 +00:00
Milan Broz	2045e88edb	dm log: move region_size validation Move log size validation from mirror target to log constructor. Removed PAGE_SIZE restriction we no longer think necessary. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:01 +00:00
Takahiro Yasui	6f3af01cb0	dm log: avoid reinitialising io_req on every operation rw_header function updates three members of io_req data every time when I/O is processed. bi_rw and notify.fn are never modified once they get initialized, and so they can be set in advance. header_to_disk() can also be pulled out of write_header() since only one caller needs it and write_header() can be replaced by rw_header() directly. Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:04:59 +00:00
Mikulas Patocka	10d3bd09a3	dm: consolidate target deregistration error handling Change dm_unregister_target to return void and use BUG() for error reporting. dm_unregister_target can only fail because of programming bug in the target driver. It can't fail because of user's behavior or disk errors. This patch changes unregister_target to return void and use BUG if someone tries to unregister non-registered target or unregister target that is in use. This patch removes code duplication (testing of error codes in all dm targets) and reports bugs in just one place, in dm_unregister_target. In some target drivers, these return codes were ignored, which could lead to a situation where bugs could be missed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:04:58 +00:00
Jonathan Brassow	d460c65a6a	dm raid1: fix error count Always increase the error count when I/O on a leg of a mirror fails. The error count is used to decide whether to select an alternative mirror leg. If the target doesn't use the "handle_errors" feature, the error count is not updated and the bio can get requeued forever by the read callback. Fix it by increasing error_count before the handle_errors feature checking. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:04:57 +00:00
Takahiro Yasui	c7a2bd19b7	dm log: fix dm_io_client leak on error paths In create_log_context function, dm_io_client_destroy function needs to be called, when memory allocation of disk_header, sync_bits and recovering_bits failed, but dm_io_client_destroy is not called. Cc: stable@kernel.org Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Acked-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:04:56 +00:00
Mikulas Patocka	90fa1527bd	dm snapshot: change yield to msleep Change yield() to msleep(1). If the thread had realtime priority, yield() doesn't really yield, so the yielding process would loop indefinitely and cause machine lockup. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:04:54 +00:00
Mikulas Patocka	a1b51e9867	dm table: drop reference at unbind Move one dm_table_put() so that the last reference in the thread gets dropped in __unbind(). This is required for a following patch, dm-table-rework-reference-counting.patch, which will change the logic in such a way that table destructor is called only at specific points in the code. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:04:53 +00:00
Jens Axboe	bb799ca020	bio: allow individual slabs in the bio_set Instead of having a global bio slab cache, add a reference to one in each bio_set that is created. This allows for personalized slabs in each bio_set, so that they can have bios of different sizes. This means we can personalize the bios we return. File systems may want to embed the bio inside another structure, to avoid allocation more items (and stuffing them in ->bi_private) after the get a bio. Or we may want to embed a number of bio_vecs directly at the end of a bio, to avoid doing two allocations to return a bio. This is now possible. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-12-29 08:29:23 +01:00
Ingo Molnar	db8862eafe	Merge branch 'linus' into tracing/hw-branch-tracing	2008-12-24 21:08:26 +01:00
NeilBrown	a2ed9615e3	md: Don't read past end of bitmap when reading bitmap. When we read the write-intent-bitmap off the device, we currently read a whole number of pages. When PAGE_SIZE is 4K, this works due to the alignment we enforce on the superblock and bitmap. When PAGE_SIZE is 64K, this case read past the end-of-device which causes an error. When we write the superblock, we ensure to clip the last page to just be the required size. Copy that code into the read path to just read the required number of sectors. Signed-off-by: Neil Brown <neilb@suse.de> Cc: stable@kernel.org	2008-12-19 16:25:01 +11:00
Ingo Molnar	970987beb9	Merge branches 'tracing/ftrace', 'tracing/function-graph-tracer' and 'tracing/urgent' into tracing/core	2008-12-05 14:45:22 +01:00
Milan Broz	0e435ac26e	block: fix setting of max_segment_size and seg_boundary mask Fix setting of max_segment_size and seg_boundary mask for stacked md/dm devices. When stacking devices (LVM over MD over SCSI) some of the request queue parameters are not set up correctly in some cases by default, namely max_segment_size and and seg_boundary mask. If you create MD device over SCSI, these attributes are zeroed. Problem become when there is over this mapping next device-mapper mapping - queue attributes are set in DM this way: request_queue max_segment_size seg_boundary_mask SCSI 65536 0xffffffff MD RAID1 0 0 LVM 65536 -1 (64bit) Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of physical segments according to these parameters. During the generic_make_request() is segment cout recalculated and can increase bio->bi_phys_segments count over the allowed limit. (After bio_clone() in stack operation.) Thi is specially problem in CCISS driver, where it produce OOPS here BUG_ON(creq->nr_phys_segments > MAXSGENTRIES); (MAXSEGENTRIES is 31 by default.) Sometimes even this command is enough to cause oops: dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10 This command generates bios with 250 sectors, allocated in 32 4k-pages (last page uses only 1024 bytes). For LVM layer, it allocates bio with 31 segments (still OK for CCISS), unfortunatelly on lower layer it is recalculated to 32 segments and this violates CCISS restriction and triggers BUG_ON(). The patch tries to fix it by: * initializing attributes above in queue request constructor blk_queue_make_request() * make sure that blk_queue_stack_limits() inherits setting (DM uses its own function to set the limits because it blk_queue_stack_limits() was introduced later. It should probably switch to use generic stack limit function too.) * sets the default seg_boundary value in one place (blkdev.h) * use this mask as default in DM (instead of -1, which differs in 64bit) Bugs related to this: https://bugzilla.redhat.com/show_bug.cgi?id=471639 http://bugzilla.kernel.org/show_bug.cgi?id=8672 Signed-off-by: Milan Broz <mbroz@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Tejun Heo <htejun@gmail.com> Cc: Mike Miller <mike.miller@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-12-03 12:55:55 +01:00
Ingo Molnar	0bfc24559d	blktrace: port to tracepoints, update Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE() sites. Spread them out to the usage sites, as suggested by Mathieu Desnoyers. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>	2008-11-26 13:04:35 +01:00
Arnaldo Carvalho de Melo	5f3ea37c77	blktrace: port to tracepoints This was a forward port of work done by Mathieu Desnoyers, I changed it to encode the 'what' parameter on the tracepoint name, so that one can register interest in specific events and not on classes of events to then check the 'what' parameter. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-26 12:13:34 +01:00
Chandra Seetharaman	8a57dfc6f9	dm: avoid destroying table in dm_any_congested dm_any_congested() just checks for the DMF_BLOCK_IO and has no code to make sure that suspend waits for dm_any_congested() to complete. This patch adds such a check. Without it, a race can occur with dm_table_put() attempting to destroying the table in the wrong thread, the one running dm_any_congested() which is meant to be quick and return immediately. Two examples of problems: 1. Sleeping functions called from congested code, the caller of which holds a spin lock. 2. An ABBA deadlock between pdflush and multipathd. The two locks in contention are inode lock and kernel lock. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-11-13 23:39:14 +00:00
Mikulas Patocka	d221d2e776	dm: move pending queue wake_up end_io_acct This doesn't fix any bug, just moves wake_up immediately after decrementing md->pending, for better code readability. It must be clear to anyone manipulating md->pending to wake up the queue if md->pending reaches zero, so move the wakeup as close to the decrementing as possible. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-11-13 23:39:10 +00:00
Chandra Seetharaman	14e98c5ca8	dm mpath: warn if args ignored Currently dm ignores the parameters provided to hardware handlers without providing any notifications to the user. This patch just prints a warning message so that the user knows that the arguments are ignored. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-11-13 23:39:06 +00:00
Chandra Seetharaman	b81aa1c792	dm mpath: avoid attempting to activate null path Path activation code is called even when the pgpath is NULL. This could lead to a panic in activate_path(). Such a panic is seen in -rt kernel. This problem has been there before the pg_init() was moved to a workqueue. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-11-13 23:39:00 +00:00
Heinz Mauelshagen	6edebdee48	dm stripe: fix init failure Don't proceed if dm_stripe_init() fails to register itself as a dm target. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-11-13 23:38:56 +00:00
Mikulas Patocka	18776c7316	dm raid1: flush workqueue before destruction We queue work on keventd queue --- so this queue must be flushed in the destructor. Otherwise, keventd could access mirror_set after it was freed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2008-11-13 23:38:52 +00:00
Andre Noll	f1cd14ae52	md: linear: Fix a division by zero bug for very small arrays. We currently oops with a divide error on starting a linear software raid array consisting of at least two very small (< 500K) devices. The bug is caused by the calculation of the hash table size which tries to compute sector_div(sz, base) with "base" being zero due to the small size of the component devices of the array. Fix this by requiring the hash spacing to be at least one which implies that also "base" is non-zero. This bug has existed since about 2.6.14. Cc: stable@kernel.org Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-11-06 19:41:24 +11:00
NeilBrown	a53a6c8575	md: fix bug in raid10 recovery. Adding a spare to a raid10 doesn't cause recovery to start. This is due to an silly type in commit `6c2fce2ef6` and so is a bug in 2.6.27 and .28-rc. Thanks to Thomas Backlund for bisecting to find this. Cc: Thomas Backlund <tmb@mandriva.org> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2008-11-06 17:28:20 +11:00
NeilBrown	cb3ac42b8a	md: revert the recent addition of a call to the BLKRRPART ioctl. It turns out that it is only safe to call blkdev_ioctl when the device is actually open (as ->bd_disk is set to NULL on last close). And it is quite possible for do_md_stop to be called when the device is not open. So discard the call to blkdev_ioctl(BLKRRPART) which was added in commit `934d9c23b4` It is just as easy to call this ioctl from userspace when needed (on mdadm -S) so leave it out of the kernel Signed-off-by: NeilBrown <neilb@suse.de>	2008-11-06 17:28:01 +11:00
Linus Torvalds	721d5dfe7e	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: destroy partitions and notify udev when md array is stopped.	2008-10-30 18:36:16 -07:00
Mikulas Patocka	879129d208	dm snapshot: wait for chunks in destructor If there are several snapshots sharing an origin and one is removed while the origin is being written to, the snapshot's mempool may get deleted while elements are still referenced. Prior to dm-snapshot-use-per-device-mempools.patch the pending exceptions may still have been referenced after the snapshot was destroyed, but this was not a problem because the shared mempool was still there. This patch fixes the problem by tracking the number of mempool elements in use. The scenario: - You have an origin and two snapshots 1 and 2. - Someone writes to the origin. - It creates two exceptions in the snapshots, snapshot 1 will be primary exception, snapshot 2's pending_exception->primary_pe will point to the exception in snapshot 1. - The exceptions are being relocated, relocation of exception 1 finishes (but it's pending_exception is still allocated, because it is referenced by an exception from snapshot 2) - The user lvremoves snapshot 1 --- it calls just suspend (does nothing) and destructor. md->pending is zero (there is no I/O submitted to the snapshot by md layer), so it won't help us. - The destructor waits for kcopyd jobs to finish on snapshot 1 --- but there are none. - The destructor on snapshot 1 cleans up everything. - The relocation of exception on snapshot 2 finishes, it drops reference on primary_pe. This frees its primary_pe pointer. Primary_pe points to pending exception created for snapshot 1. So it frees memory into non-existing mempool. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-30 13:33:16 +00:00
Mikulas Patocka	60c856c8e2	dm snapshot: fix register_snapshot deadlock register_snapshot() performs a GFP_KERNEL allocation while holding _origins_lock for write, but that could write out dirty pages onto a device that attempts to acquire _origins_lock for read, resulting in deadlock. So move the allocation up before taking the lock. This path is not performance-critical, so it doesn't matter that we allocate memory and free it if we find that we won't need it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-30 13:33:12 +00:00
Ilpo Jarvinen	b34578a484	dm raid1: fix do_failures Missing braces. Commit `1f965b1943` (dm raid1: separate region_hash interface part1) broke it. Signed-off-by: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Heinz Mauelshagen <hjm@redhat.com>	2008-10-30 13:33:07 +00:00
NeilBrown	934d9c23b4	md: destroy partitions and notify udev when md array is stopped. md arrays are not currently destroyed when they are stopped - they remain in /sys/block. Last time I tried this I tripped over locking too much. A consequence of this is that udev doesn't remove anything from /dev. This is rather ugly. As an interim measure until proper device removal can be achieved, make sure all partitions are removed using the BLKRRPART ioctl, and send a KOBJ_CHANGE when an md array is stopped. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-28 17:01:23 +11:00
Linus Torvalds	f8d56f1771	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: allow extended partitions on md devices. md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state md: use sysfs_notify_dirent to notify changes to md/array_state	2008-10-26 16:42:18 -07:00
Linus Torvalds	2248485640	Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev * git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits) [PATCH] kill the rest of struct file propagation in block ioctls [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET [PATCH] get rid of blkdev_locked_ioctl() [PATCH] get rid of blkdev_driver_ioctl() [PATCH] sanitize blkdev_get() and friends [PATCH] remember mode of reiserfs journal [PATCH] propagate mode through swsusp_close() [PATCH] propagate mode through open_bdev_excl/close_bdev_excl [PATCH] pass fmode_t to blkdev_put() [PATCH] kill the unused bsize on the send side of /dev/loop [PATCH] trim file propagation in block/compat_ioctl.c [PATCH] end of methods switch: remove the old ones [PATCH] switch sr [PATCH] switch sd [PATCH] switch ide-scsi [PATCH] switch tape_block [PATCH] switch dcssblk [PATCH] switch dasd [PATCH] switch mtd_blkdevs [PATCH] switch mmc ...	2008-10-23 10:23:07 -07:00
Linus Torvalds	5ed487bc2c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (46 commits) [PATCH] fs: add a sanity check in d_free [PATCH] i_version: remount support [patch] vfs: make security_inode_setattr() calling consistent [patch 1/3] FS_MBCACHE: don't needlessly make it built-in [PATCH] move executable checking into ->permission() [PATCH] fs/dcache.c: update comment of d_validate() [RFC PATCH] touch_mnt_namespace when the mount flags change [PATCH] reiserfs: add missing llseek method [PATCH] fix ->llseek for more directories [PATCH vfs-2.6 6/6] vfs: add LOOKUP_RENAME_TARGET intent [PATCH vfs-2.6 5/6] vfs: remove LOOKUP_PARENT from non LOOKUP_PARENT lookup [PATCH vfs-2.6 4/6] vfs: remove unnecessary fsnotify_d_instantiate() [PATCH vfs-2.6 3/6] vfs: add __d_instantiate() helper [PATCH vfs-2.6 2/6] vfs: add d_ancestor() [PATCH vfs-2.6 1/6] vfs: replace parent == dentry->d_parent by IS_ROOT() [PATCH] get rid of on-stack dentry in udf [PATCH 2/2] anondev: switch to IDA [PATCH 1/2] anondev: init IDR statically [JFFS2] Use d_splice_alias() not d_add() in jffs2_lookup() [PATCH] Optimise NFS readdir hack slightly. ...	2008-10-23 10:22:40 -07:00
Christoph Hellwig	72e8264eda	[PATCH] dm: kill lookup_device wrapper Now that lookup_bdev is exported and used by dm just use it directly instead of through a trivial wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-23 05:12:57 -04:00
Kiyoshi Ueda	51157b4ab4	dm: tidy local_init This patch tidies local_init() in preparation for request-based dm. No functional change. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:08 +01:00
Kiyoshi Ueda	f431d9666f	dm: remove unused flush_all This patch removes the DM_WQ_FLUSH_ALL state that is unnecessary. The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend() is never invoked because: - 'goto flush_and_out' is the same as 'goto out' because the 'goto flush_and_out' is called only when '!noflush' - If r is non-zero, then the code above will invoke 'goto out' and skip this code. No functional change. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:07 +01:00
Heinz Mauelshagen	1f965b1943	dm raid1: separate region_hash interface part1 Separate the region hash code from raid1 so it can be shared by forthcoming targets. Use BUG_ON() for failed async dm_io() calls. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:06 +01:00
Martin K. Petersen	f3e1d26ede	dm: mark split bio as cloned When a bio gets split, mark its fragments with the BIO_CLONED flag. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:04 +01:00
Milan Broz	0a4a1047a4	dm crypt: remove waitqueue Remove waitqueue no longer needed with the async crypto interface. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:03 +01:00
Milan Broz	393b47ef23	dm crypt: fix async split When writing io, dm-crypt has to allocate a new cloned bio and encrypt the data into newly-allocated pages attached to this bio. In rare cases, because of hw restrictions (e.g. physical segment limit) or memory pressure, sometimes more than one cloned bio has to be used, each processing a different fragment of the original. Currently there is one waitqueue which waits for one fragment to finish and continues processing the next fragment. But when using asynchronous crypto this doesn't work, because several fragments may be processed asynchronously or in parallel and there is only one crypt context that cannot be shared between the bio fragments. The result may be corruption of the data contained in the encrypted bio. The patch fixes this by allocating new dm_crypt_io structs (with new crypto contexts) and running them independently. The fragments contains a pointer to the base dm_crypt_io struct to handle reference counting, so the base one is properly deallocated after all the fragments are finished. In a low memory situation, this only uses one additional object from the mempool. If the mempool is empty, the next allocation simple waits for previous fragments to complete. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:02 +01:00
Milan Broz	b635b00e0e	dm crypt: tidy sector Prepare local sector variable (offset) for later patch. Do not update io->sector for still-running I/O. No functional change. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:00 +01:00
Mikulas Patocka	586e80e6ee	dm: remove dm header from targets Change #include "dm.h" to #include <linux/device-mapper.h> in all targets. Targets should not need direct access to internal DM structures. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:44:59 +01:00
Mikulas Patocka	d63a5ce3c0	dm: publish array_too_big Move array_too_big to include/linux/device-mapper.h because it is used by targets. Remove the test from dm-raid1 as the number of mirror legs is limited such that it can never fail. (Even for stripes it seems rather unlikely.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:44:57 +01:00
Mikulas Patocka	7acedc5b98	dm exception store: fix misordered writes We must zero the next chunk on disk before writing out the current chunk, not after. Otherwise if the machine crashes at the wrong time, the "end of metadata" marker may be missing. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2008-10-21 17:44:56 +01:00
Alasdair G Kergon	7c9e6c1732	dm exception store: refactor zero_area Use a separate buffer for writing zeroes to the on-disk snapshot exception store, make the updating of ps->current_area explicit and refactor the code in preparation for the fix in the next patch. No functional change. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@kernel.org	2008-10-21 17:44:55 +01:00
Mikulas Patocka	f68d4f3d39	dm snapshot: drop unused last_percent The last_percent field is unused - remove it. (It dates from when events were triggered as each X% filled up.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:44:53 +01:00
Mikulas Patocka	7c5f78b9d7	dm snapshot: fix primary_pe race Fix a race condition with primary_pe ref_count handling. put_pending_exception runs under dm_snapshot->lock, it does atomic_dec_and_test on primary_pe->ref_count, and later does atomic_read primary_pe->ref_count. __origin_write does atomic_dec_and_test on primary_pe->ref_count without holding dm_snapshot->lock. This opens the following race condition: Assume two CPUs, CPU1 is executing put_pending_exception (and holding dm_snapshot->lock). CPU2 is executing __origin_write in parallel. primary_pe->ref_count == 2. CPU1: if (primary_pe && atomic_dec_and_test(&primary_pe->ref_count)) origin_bios = bio_list_get(&primary_pe->origin_bios); ... decrements primary_pe->ref_count to 1. Doesn't load origin_bios CPU2: if (first && atomic_dec_and_test(&primary_pe->ref_count)) { flush_bios(bio_list_get(&primary_pe->origin_bios)); free_pending_exception(primary_pe); /* If we got here, pe_queue is necessarily empty. */ return r; } ... decrements primary_pe->ref_count to 0, submits pending bios, frees primary_pe. CPU1: if (!primary_pe \|\| primary_pe != pe) free_pending_exception(pe); ... this has no effect. if (primary_pe && !atomic_read(&primary_pe->ref_count)) free_pending_exception(primary_pe); ... sees ref_count == 0 (written by CPU 2), does double free !! This bug can happen only if someone is simultaneously writing to both the origin and the snapshot. If someone is writing only to the origin, __origin_write will submit kcopyd request after it decrements primary_pe->ref_count (so it can't happen that the finished copy races with primary_pe->ref_count decrementation). If someone is writing only to the snapshot, __origin_write isn't invoked at all and the race can't happen. The race happens when someone writes to the snapshot --- this creates pending_exception with primary_pe == NULL and starts copying. Then, someone writes to the same chunk in the snapshot, and __origin_write races with termination of already submitted request in pending_complete (that calls put_pending_exception). This race may be reason for bugs: http://bugzilla.kernel.org/show_bug.cgi?id=11636 https://bugzilla.redhat.com/show_bug.cgi?id=465825 The patch fixes the code to make sure that: 1. If atomic_dec_and_test(&primary_pe->ref_count) returns false, the process must no longer dereference primary_pe (because someone else may free it under us). 2. If atomic_dec_and_test(&primary_pe->ref_count) returns true, the process is responsible for freeing primary_pe. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2008-10-21 17:44:51 +01:00
Kazuo Ito	b673c3a819	dm kcopyd: avoid queue shuffle Write throughput to LVM snapshot origin volume is an order of magnitude slower than those to LV without snapshots or snapshot target volumes, especially in the case of sequential writes with O_SYNC on. The following patch originally written by Kevin Jamieson and Jan Blunck and slightly modified for the current RCs by myself tries to improve the performance by modifying the behaviour of kcopyd, so that it pushes back an I/O job to the head of the job queue instead of the tail as process_jobs() currently does when it has to wait for free pages. This way, write requests aren't shuffled to cause extra seeks. I tested the patch against 2.6.27-rc5 and got the following results. The test is a dd command writing to snapshot origin followed by fsync to the file just created/updated. A couple of filesystem benchmarks gave me similar results in case of sequential writes, while random writes didn't suffer much. dd if=/dev/zero of=<somewhere on snapshot origin> bs=4096 count=... [conv=notrunc when updating] 1) linux 2.6.27-rc5 without the patch, write to snapshot origin, average throughput (MB/s) 10M 100M 1000M create,dd 511.46 610.72 11.81 create,dd+fsync 7.10 6.77 8.13 update,dd 431.63 917.41 12.75 update,dd+fsync 7.79 7.43 8.12 compared with write throughput to LV without any snapshots, all dd+fsync and 1000 MiB writes perform very poorly. 10M 100M 1000M create,dd 555.03 608.98 123.29 create,dd+fsync 114.27 72.78 76.65 update,dd 152.34 1267.27 124.04 update,dd+fsync 130.56 77.81 77.84 2) linux 2.6.27-rc5 with the patch, write to snapshot origin, average throughput (MB/s) 10M 100M 1000M create,dd 537.06 589.44 46.21 create,dd+fsync 31.63 29.19 29.23 update,dd 487.59 897.65 37.76 update,dd+fsync 34.12 30.07 26.85 Although still not on par with plain LV performance - cannot be avoided because it's copy on write anyway - this simple patch successfully improves throughtput of dd+fsync while not affecting the rest. Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Kazuo Ito <ito.kazuo@oss.ntt.co.jp> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2008-10-21 17:44:50 +01:00
Al Viro	9a1c354276	[PATCH] pass fmode_t to blkdev_put() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:48:58 -04:00
Al Viro	a39907fa2f	[PATCH] switch md Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:48:31 -04:00
Al Viro	fe5f9f2cd5	[PATCH] switch dm ioctl() doesn't need BKL here Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:48:29 -04:00
Al Viro	d4430d62fa	[PATCH] beginning of methods conversion To keep the size of changesets sane we split the switch by drivers; to keep the damn thing bisectable we do the following: 1) rename the affected methods, add ones with correct prototypes, make (few) callers handle both. That's this changeset. 2) for each driver convert to new methods. ALL drivers are converted in this series. 3) kill the old (renamed) methods. Note that it _is_ a flagday; all in-tree drivers are converted and by the end of this series no trace of old methods remain. The only reason why we do that this way is to keep the damn thing bisectable and allow per-driver debugging if anything goes wrong. New methods: open(bdev, mode) release(disk, mode) ioctl(bdev, mode, cmd, arg) /* Called without BKL / compat_ioctl(bdev, mode, cmd, arg) locked_ioctl(bdev, mode, cmd, arg) / Called with BKL, legacy */ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:47:32 -04:00
Al Viro	633a08b812	[PATCH] introduce __blkdev_driver_ioctl() Analog of blkdev_driver_ioctl() with sane arguments. For now uses fake struct file, by the end of the series it won't and blkdev_driver_ioctl() will become a wrapper around it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:47:26 -04:00
Al Viro	647b3d0084	[PATCH] lose unused arguments in dm ioctl callbacks Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:47:18 -04:00
Al Viro	aeb5d72706	[PATCH] introduce fmode_t, do annotations Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:47:06 -04:00
NeilBrown	92850bbd71	md: allow extended partitions on md devices. The new extended partition support provides a much nicer was to have partitions on md devices that the 'mdp' alternate major. We cannot really get rid of 'mdp' at this time, but we can enable extended partitions as that will probably make life easier for sysadmins. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-21 13:25:32 +11:00
NeilBrown	3c0ee63a64	md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state The 'state' file for a device reports, for example, when the device has failed. Changes should be reported to userspace ASAP without the possibility of blocking on low-memory. sysfs_notify does have that possibility (as it takes a mutex which can be held across a kmalloc) so use sysfs_notify_dirent instead. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-21 13:25:28 +11:00
NeilBrown	b62b75905d	md: use sysfs_notify_dirent to notify changes to md/array_state Now that we have sysfs_notify_dirent, use it to notify changes to md/array_state. As sysfs_notify_dirent can be called in atomic context, we can remove the delayed notify and the MD_NOTIFY_ARRAY_STATE flag. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-21 13:25:21 +11:00
Linus Torvalds	ed09441dac	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (39 commits) [SCSI] sd: fix compile failure with CONFIG_BLK_DEV_INTEGRITY=n libiscsi: fix locking in iscsi_eh_device_reset libiscsi: check reason why we are stopping iscsi session to determine error value [SCSI] iscsi_tcp: return a descriptive error value during connection errors [SCSI] libiscsi: rename host reset to target reset [SCSI] iscsi class: fix endpoint id handling [SCSI] libiscsi: Support drivers initiating session removal [SCSI] libiscsi: fix data corruption when target has to resend data-in packets [SCSI] sd: Switch kernel printing level for DIF messages [SCSI] sd: Correctly handle all combinations of DIF and DIX [SCSI] sd: Always print actual protection_type [SCSI] sd: Issue correct protection operation [SCSI] scsi_error: fix target reset handling [SCSI] lpfc 8.2.8 v2 : Add statistical reporting control and additional fc vendor events [SCSI] lpfc 8.2.8 v2 : Add sysfs control of target queue depth handling [SCSI] lpfc 8.2.8 v2 : Revert target busy in favor of transport disrupted [SCSI] scsi_dh_alua: remove REQ_NOMERGE [SCSI] lpfc 8.2.8 : update driver version to 8.2.8 [SCSI] lpfc 8.2.8 : Add MSI-X support [SCSI] lpfc 8.2.8 : Update driver to use new Host byte error code DID_TRANSPORT_DISRUPTED ...	2008-10-17 09:00:23 -07:00
Linus Torvalds	c472273f86	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: fix input truncation in safe_delay_store() md: check for memory allocation failure in faulty personality md: build failure due to missing delay.h md: Relax minimum size restrictions on chunk_size. md: remove space after function name in declaration and call. md: Remove unnecessary #includes, #defines, and function declarations. md: Convert remaining 1k representations in linear.c to sectors. md: linear.c: Make two local variables sector-based. md: linear: Represent dev_info->size and dev_info->offset in sectors. md: linear.c: Remove broken debug code. md: linear.c: Remove pointless initialization of curr_offset. md: linear.c: Fix typo in comment. md: Don't try to set an array to 'read-auto' if it is already in that state. md: Allow metadata_version to be updated for externally managed metadata. md: Fix rdev_size_store with size == 0	2008-10-16 11:55:11 -07:00
Dan Williams	97ce0a7f9c	md: fix input truncation in safe_delay_store() safe_delay_store() currently truncates the last character of input since it tells strlcpy that the buffer can only hold 'len' characters, off by one. sysfs already null terminates the buffer, so just increase the last argument to strlcpy. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-16 17:03:08 +11:00
Sven Wegener	08ff39f1c8	md: check for memory allocation failure in faulty personality It's a fault injection module, but I don't think we should oops here. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-10-16 14:16:53 +11:00
Stephen Rothwell	255707274e	md: build failure due to missing delay.h Today's linux-next build (powerpc ppc64_defconfig) failed like this: drivers/md/raid1.c: In function 'sync_request': drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible' make[3]: * [drivers/md/raid1.o] Error 1 make[3]: * Waiting for unfinished jobs.... drivers/md/raid10.c: In function 'sync_request': drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible' make[3]: *** [drivers/md/raid10.o] Error 1 drivers/md/md.c: In function 'md_do_sync': drivers/md/md.c:5915: error: implicit declaration of function 'msleep' Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove unnecessary #includes, #defines, and function declarations"). I added the following patch. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-15 21:57:05 +11:00
Mike Christie	6000a368cd	[SCSI] block: separate failfast into multiple bits. Multipath is best at handling transport errors. If it gets a device error then there is not much the multipath layer can do. It will just access the same device but from a different path. This patch breaks up failfast into device, transport and driver errors. The multipath layers (md and dm mutlipath) only ask the lower levels to fast fail transport errors. The user of failfast, read ahead, will ask to fast fail on all errors. Note that blk_noretry_request will return true if any failfast bit is set. This allows drivers that do not support the multipath failfast bits to continue to fail on any failfast error like before. Drivers like scsi that are able to fail fast specific errors can check for the specific fail fast type. In the next patch I will convert scsi. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-10-13 09:28:52 -04:00
NeilBrown	4bbf3771ca	md: Relax minimum size restrictions on chunk_size. Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE. This makes moving an array to a machine with a larger PAGE_SIZE, or changing the kernel to use a larger PAGE_SIZE, can stop an array from working. For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync process works on whole pages at a time, and assumes them to be wholly within a stripe. For other raid personalities, this restriction is not needed at all and can be dropped. So remove the test on chunk_size from common can, and add it in just the places where it is needed: raid10 and raid4/5/6. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
NeilBrown	d710e13812	md: remove space after function name in declaration and call. Having function (args) instead of function(args) make is harder to search for calls of particular functions. So remove all those spaces. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
NeilBrown	fb4d8c76e5	md: Remove unnecessary #includes, #defines, and function declarations. A lot of cruft has gathered over the years. Time to remove it. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
Andre Noll	ab5bd5cbc8	md: Convert remaining 1k representations in linear.c to sectors. This patch renames hash_spacing and preshift to spacing and sector_shift respectively with the following change of semantics: Case 1: (sizeof(sector_t) <= sizeof(u32)). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In this case, we have sector_shift = preshift = 0 and spacing = 2 * hash_spacing. Hence, the index for the hash table which is computed by the new code in which_dev() as sector / spacing equals the old value which was (sector/2) / hash_spacing. Note also that the value of nb_zone stays the same because both sz and base double. Case 2: (sizeof(sector_t) > sizeof(u32)). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (aka the shifting dance case). Here we have sector_shift = preshift + 1 and spacing = 2 * hash_spacing during the computation of nb_zone and curr_sector, but spacing = hash_spacing in which_dev() because in the last hunk of the patch for linear.c we shift down conf->spacing (= 2 * hash_spacing) by one more bit than in the old code. Hence in the computation of nb_zone, sz and base have the same value as before, so nb_zone is not affected. Also curr_sector in the next hunk stays the same. In which_dev() the hash table index is computed as (sector >> sector_shift) / spacing In view of sector_shift = preshift + 1 and spacing = hash_spacing, this equals ((sector/2) >> preshift) / hash_spacing which is the value computed by the old code. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
Andre Noll	23242fbb47	md: linear.c: Make two local variables sector-based. This is a preparation for representing also the remaining fields of struct linear_private_data as sectors. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
Andre Noll	6283815d18	md: linear: Represent dev_info->size and dev_info->offset in sectors. Rename them to num_sectors and start_sector which is more descriptive. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
Andre Noll	451708d2a4	md: linear.c: Remove broken debug code. conf->smallest_size is undefined since day one of the git repo.. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
Andre Noll	481d86c7eb	md: linear.c: Remove pointless initialization of curr_offset. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
Andre Noll	e61130228e	md: linear.c: Fix typo in comment. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
NeilBrown	80268ee927	md: Don't try to set an array to 'read-auto' if it is already in that state. 'read-auto' is a variant of 'readonly' which will switch to writable on the first write attempt. Calling do_md_stop to set the array readonly when it is already readonly returns an error. So make sure not to do that. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
NeilBrown	ea43ddd849	md: Allow metadata_version to be updated for externally managed metadata. For externally managed metadata, the 'metadata_version' sysfs attribute is really just a channel for user-space programs to communicate about how the array is being managed. It can be useful for this to be changed while the array is active. Normally changes to metadata_version are not permitted while the array is active. Change that so that if the metadata is externally managed, the metadata_version can be changed to a different flavour of external management. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:11 +11:00
Chris Webb	7d3c6f8717	md: Fix rdev_size_store with size == 0 Fix rdev_size_store with size == 0. size == 0 means to use the largest size allowed by the underlying device and is used when modifying an active array. This fixes a regression introduced by commit `d7027458d6` Cc: <stable@kernel.org> Signed-off-by: Chris Webb <chris@arachsys.com> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:11 +11:00
Alan Jenkins	ce52aebd02	raid, fastboot: hide RAID autodetect option if MD is compiled as a module Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-12 08:25:14 -07:00
Arjan van de Ven	a364092a41	raid: make RAID autodetect default a KConfig option RAID autodetect has the side effect of requiring synchronisation of all device drivers, which can make the boot several seconds longer (I've measured 7 on one of my laptops).... even for systems that don't have RAID setup for the root filesystem (the only FS where this matters). This patch makes the default for autodetect a config option; either way the user can always override via the kernel command line. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: NeilBrown <neilb@suse.de>	2008-10-12 08:25:02 -07:00
Linus Torvalds	b0af205afb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: dm: detect lost queue dm: publish dm_vcalloc dm: publish dm_table_unplug_all dm: publish dm_get_mapinfo dm: export struct dm_dev dm crypt: avoid unnecessary wait when splitting bio dm crypt: tidy ctx pending dm crypt: fix async inc_pending dm crypt: move dec_pending on error into write_io_submit dm crypt: remove inc_pending from write_io_submit dm crypt: tidy write loop pending dm crypt: tidy crypt alloc dm crypt: tidy inc pending dm exception store: use chunk_t for_areas dm exception store: introduce area_location function dm raid1: kcopyd should stop on error if errors handled dm mpath: remove is_active from struct dm_path dm mpath: use more error codes Fixed up trivial conflict in drivers/md/dm-mpath.c manually.	2008-10-10 11:11:47 -07:00
Alasdair G Kergon	0c2322e4ce	dm: detect lost queue Detect and report buggy drivers that destroy their request_queue. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Stefan Raspl <raspl@linux.vnet.ibm.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org>	2008-10-10 13:37:13 +01:00
Mikulas Patocka	5416090426	dm: publish dm_vcalloc Publish dm_vcalloc in include/linux/device-mapper.h because this function is used by targets. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:12 +01:00
Mikulas Patocka	ea0ec64094	dm: publish dm_table_unplug_all Publish dm_table_unplug_all in include/linux/device-mapper.h because this function is used by targets. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:11 +01:00
Mikulas Patocka	89343da077	dm: publish dm_get_mapinfo Publish dm_get_mapinfo in include/linux/device-mapper.h because this function is used by targets. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:10 +01:00
Mikulas Patocka	82b1519b34	dm: export struct dm_dev Split struct dm_dev in two and publish the part that other targets need in include/linux/device-mapper.h. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:09 +01:00
Milan Broz	933f01d433	dm crypt: avoid unnecessary wait when splitting bio Don't wait between submitting crypt requests for a bio unless we are short of memory. There are two situations when we must split an encrypted bio: 1) there are no free pages; 2) the new bio would violate underlying device restrictions (e.g. max hw segments). In case (2) we do not need to wait. Add output variable to crypt_alloc_buffer() to distinguish between these cases. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:08 +01:00
Milan Broz	c8081618a9	dm crypt: tidy ctx pending Move the initialisation of ctx->pending into one place, at the start of crypt_convert(). Introduce crypt_finished to indicate whether or not the encryption is finished, for use in a later patch. No functional change. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:08 +01:00
Milan Broz	4e59409891	dm crypt: fix async inc_pending The pending reference count must be incremented before the async work is queued to another thread, not after. Otherwise there's a race if the work completes and decrements the reference count before it gets incremented. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:07 +01:00
Milan Broz	6c031f41db	dm crypt: move dec_pending on error into write_io_submit Make kcryptd_crypt_write_io_submit() responsible for decrementing the pending count after an error. Also fixes a bug in the async path that forgot to decrement it. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:06 +01:00
Alasdair G Kergon	1e37bb8e55	dm crypt: remove inc_pending from write_io_submit Make the caller reponsible for incrementing the pending count before calling kcryptd_crypt_write_io_submit() in the non-async case to bring it into line with the async case. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:05 +01:00
Milan Broz	fc5a5e9aa8	dm crypt: tidy write loop pending Move kcryptd_crypt_write_convert_loop inside kcryptd_crypt_write_convert. This change is needed for a later patch. No functional change. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:04 +01:00
Milan Broz	dc440d1e56	dm crypt: tidy crypt alloc Factor out crypt io allocation code. Later patches will call it from another place. No functional change. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:03 +01:00
Milan Broz	3e1a8bdd05	dm crypt: tidy inc pending Move io pending to one place. No functional change, usefull to simplify debugging. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:02 +01:00
Mikulas Patocka	fd14acf6fc	dm exception store: use chunk_t for_areas Change uint32_t into chunk_t to remove 32-bit limitation on the number of chunks on systems with 64-bit sector numbers. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:01 +01:00
Mikulas Patocka	a481db7846	dm exception store: introduce area_location function Move this logic to a function, because it will be reused later. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:00 +01:00
Jonathan Brassow	f7c83e2e47	dm raid1: kcopyd should stop on error if errors handled dm-raid1 is setting the 'DM_KCOPYD_IGNORE_ERROR' flag unconditionally when assigning kcopyd work. kcopyd is responsible for copying an assigned section of disk to one or more other disks. The 'DM_KCOPYD_IGNORE_ERROR' flag affects kcopyd in the following way: When not set: kcopyd will immediately stop the copy operation when an error is encountered. When set: kcopyd will try to proceed regardless of errors and try to continue copying any remaining amount. Since dm-raid1 tracks regions of the address space that are (or are not) in sync and it now has the ability to handle these errors, we can safely enable this optimization. This optimization is conditional on whether mirror error handling has been enabled. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:36:59 +01:00
Kiyoshi Ueda	6680073d3e	dm mpath: remove is_active from struct dm_path This patch moves 'is_active' from struct dm_path to struct pgpath as it does not need exporting. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:36:58 +01:00
Benjamin Marzinski	01460f3520	dm mpath: use more error codes This patch allows path errors from the multipath ctr function to propagate up to userspace as errno values from the ioctl() call. This is in response to https://www.redhat.com/archives/dm-devel/2008-May/msg00000.html and https://bugzilla.redhat.com/show_bug.cgi?id=444421 The patch only lets through the errors that it needs to in order to get the path errors from parse_path(). Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:36:57 +01:00
Denis ChengRq	6feef531f5	block: mark bio_split_pool static Since all bio_split calls refer the same single bio_split_pool, the bio_split function can use bio_split_pool directly instead of the mempool_t parameter; then the mempool_t parameter can be removed from bio_split param list, and bio_split_pool is only referred in fs/bio.c file, can be marked static. Signed-off-by: Denis ChengRq <crquan@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:57:05 +02:00
Mike Anderson	224cb3e981	dm: Call blk_abort_queue on failed paths Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:14 +02:00
Tejun Heo	074a7aca7a	block: move stats from disk to part0 Move stats related fields - stamp, in_flight, dkstats - from disk to part0 and unify stat handling such that... * part_stat_() now updates part0 together if the specified partition is not part0. ie. part_stat_() are now essentially all_stat_(). {disk\|all}_stat_() are gone. part_round_stats() is updated similary. It handles part0 stats automatically and disk_round_stats() is killed. * part_{inc\|dec}_in_fligh() is implemented which automatically updates part0 stats for parts other than part0. * disk_map_sector_rcu() is updated to return part0 if no part matches. Combined with the above changes, this makes NULL special case handling in callers unnecessary. * Separate stats show code paths for disk are collapsed into part stats show code paths. * Rename disk_stat_lock/unlock() to part_stat_lock/unlock() While at it, reposition stat handling macros a bit and add missing parentheses around macro parameters. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	0762b8bde9	block: always set bdev->bd_part Till now, bdev->bd_part is set only if the bdev was for parts other than part0. This patch makes bdev->bd_part always set so that code paths don't have to differenciate common handling. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	b7db9956e5	block: move policy from disk to part0 Move disk->policy to part0->policy. Implement and use get_disk_ro(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	ed9e198234	block: implement and use {disk\|part}_to_dev() Implement {disk\|part}_to_dev() and use them to access generic device instead of directly dereferencing {disk\|part}->dev. To make sure no user is left behind, rename generic devices fields to __dev. This is in preparation of unifying partition 0 handling with other partitions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	c995905916	block: fix diskstats access There are two variants of stat functions - ones prefixed with double underbars which don't care about preemption and ones without which disable preemption before manipulating per-cpu counters. It's unclear whether the underbarred ones assume that preemtion is disabled on entry as some callers don't do that. This patch unifies diskstats access by implementing disk_stat_lock() and disk_stat_unlock() which take care of both RCU (for partition access) and preemption (for per-cpu counter access). diskstats access should always be enclosed between the two functions. As such, there's no need for the versions which disables preemption. They're removed and double underbars ones are renamed to drop the underbars. As an extra argument is added, there's no danger of using the old version unconverted. disk_stat_lock() uses get_cpu() and returns the cpu index and all diskstat functions which access per-cpu counters now has @cpu argument to help RT. This change adds RCU or preemption operations at some places but also collapses several preemption ops into one at others. Overall, the performance difference should be negligible as all involved ops are very lightweight per-cpu ones. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:06 +02:00
Tejun Heo	f331c0296f	block: don't depend on consecutive minor space * Implement disk_devt() and part_devt() and use them to directly access devt instead of computing it from ->major and ->first_minor. Note that all references to ->major and ->first_minor outside of block layer is used to determine devt of the disk (the part0) and as ->major and ->first_minor will continue to represent devt for the disk, converting these users aren't strictly necessary. However, convert them for consistency. * Implement disk_max_parts() to avoid directly deferencing genhd->minors. * Update bdget_disk() such that it doesn't assume consecutive minor space. * Move devt computation from register_disk() to add_disk() and make it the only one (all other usages use the initially determined value). These changes clean up the code and will help disk->part dereference fix and extended block device numbers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:05 +02:00
Jens Axboe	5b99c2ffa9	block: make bi_phys_segments an unsigned int instead of short raid5 can overflow with more than 255 stripes, and we can increase it to an int for free on both 32 and 64-bit archs due to the padding. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:03 +02:00
Jens Axboe	960e739d9e	block: raid fixups for removal of bi_hw_segments Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:03 +02:00
Mikulas Patocka	5df97b91b5	drop vmerge accounting Remove hw_segments field from struct bio and struct request. Without virtual merge accounting they have no purpose. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:03 +02:00
Chandra Seetharaman	7253a33434	dm mpath: add missing path switching locking Moving the path activation to workqueue along with scsi_dh patches introduced a race. It is due to the fact that the current_pgpath (in the multipath data structure) can be modified if changes happen in any of the paths leading to the lun. If the changes lead to current_pgpath being set to NULL, then it leads to the invalid access which results in the panic below. This patch fixes that by storing the pgpath to activate in the multipath data structure and properly protecting it. Note that if activate_path is called twice in succession with different pgpath, with the second one being called before the first one is done, then activate path will be called twice for the second pgpath, which is fine. Unable to handle kernel paging request for data at address 0x00000020 Faulting instruction address: 0xd000000000aa1844 cpu 0x1: Vector: 300 (Data Access) at [c00000006b987a80] pc: d000000000aa1844: .activate_path+0x30/0x218 [dm_multipath] lr: c000000000087a2c: .run_workqueue+0x114/0x204 sp: c00000006b987d00 msr: 8000000000009032 dar: 20 dsisr: 40000000 current = 0xc0000000676bb3f0 paca = 0xc0000000006f3680 pid = 2528, comm = kmpath_handlerd enter ? for help [c00000006b987da0] c000000000087a2c .run_workqueue+0x114/0x204 [c00000006b987e40] c000000000088b58 .worker_thread+0x120/0x144 [c00000006b987f00] c00000000008ca70 .kthread+0x78/0xc4 [c00000006b987f90] c000000000027cc8 .kernel_thread+0x4c/0x68 Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-01 14:39:27 +01:00
Mikulas Patocka	b01cd5ac43	dm: cope with access beyond end of device in dm_merge_bvec If for any reason dm_merge_bvec() is given an offset beyond the end of the device, avoid an oops and always allow one page to be added to an empty bio. We'll reject the I/O later after the bio is submitted. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-01 14:39:24 +01:00
Mikulas Patocka	5037108acd	dm: always allow one page in dm_merge_bvec Some callers assume they can always add at least one page to an empty bio, so dm_merge_bvec should not return 0 in this case: we'll reject the I/O later after the bio is submitted. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-01 14:39:17 +01:00
NeilBrown	9744197c3d	md: Don't wait UNINTERRUPTIBLE for other resync to finish When two md arrays share some block device (e.g each uses different partitions on the one device), a resync of one array will wait for the resync on the other to finish. This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE, the softlockup code notices and complains. So use TASK_INTERRUPTIBLE instead and make sure to flush signals before calling schedule. Signed-off-by: NeilBrown <neilb@suse.de>	2008-09-19 11:49:54 +10:00
NeilBrown	b2d2c4cead	Fix problem with waiting while holding rcu read lock in md/bitmap.c A recent patch to protect the rdev list with rcu locking leaves us with a problem because we can sleep on memalloc while holding the rcu lock. The rcu lock is only needed while walking the linked list as uninteresting devices (failed or spares) can be removed at any time. So only take the rcu lock while actually walking the linked list. Take a refcount on the rdev during the time when we drop the lock and do the memalloc to start IO. When we return to the locked code, all the interesting devices on the list will not have moved, so we can simply use list_for_each_continue_rcu to pick up where we left off. Signed-off-by: NeilBrown <neilb@suse.de>	2008-09-01 12:48:13 +10:00
NeilBrown	271f5a9b8f	Remove invalidate_partition call from do_md_stop. When stopping an md array, or just switching to read-only, we currently call invalidate_partition while holding the mddev lock. The main reason for this is probably to ensure all dirty buffers are flushed (invalidate_partition calls fsync_bdev). However if any dirty buffers are found, it will almost certainly cause a deadlock as starting writeout will require an update to the superblock, and performing that updates requires taking the mddev lock - which is already held. This deadlock can be demonstrated by running "reboot -f -n" with a root filesystem on md/raid, and some dirty buffers in memory. All other calls to stop an array should already happen after a flush. The normal sequence is to stop using the array (e.g. umount) which will cause __blkdev_put to call sync_blockdev. Then open the array and issue the STOP_ARRAY ioctl while the buffers are all still clean. So this invalidate_partition is normally a no-op, except for one case where it will cause a deadlock. So remove it. This patch possibly addresses the regression recored in http://bugzilla.kernel.org/show_bug.cgi?id=11460 and http://bugzilla.kernel.org/show_bug.cgi?id=11452 though it isn't yet clear how it ever worked. Signed-off-by: NeilBrown <neilb@suse.de>	2008-09-01 12:32:52 +10:00
Dan Williams	56ac36d722	md: cancel check/repair requests when recovery is needed If a 'repair' is requested when an array is in a position to 'recover' raid1 will perform the repair while md believes a recovery is happening. Address this at both ends, i.e. cancel check/repair requests upon detecting a recover condition and do not call ->spare_active after completing a check/repair. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2008-08-07 10:02:47 -07:00
NeilBrown	0310fa216d	Allow raid10 resync to happening in larger chunks. The raid10 resync/recovery code currently limits the amount of in-flight resync IO to 2Meg. This was copied from raid1 where it seems quite adequate. However for raid10, some layouts require a bit of seeking to perform a resync, and allowing a larger buffer size means that the seeking can be significantly reduced. There is probably no real need to limit the amount of in-flight IO at all. Any shortage of memory will naturally reduce the amount of buffer space available down to a set minimum, and any concurrent normal IO will quickly cause resync IO to back off. The only problem would be that normal IO has to wait for all resync IO to finish, so a very large amount of resync IO could cause unpleasant latency when normal IO starts up. So: increase RESYNC_DEPTH to allow 32Meg of buffer (if memory is available) which seems to be a good amount. Also reduce the amount of memory reserved as there is no need to keep 2Meg just for resync if memory is tight. Thanks to Keld for the suggestion. Cc: Keld Jørn Simonsen <keld@dkuug.dk> Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-05 15:56:32 +10:00
NeilBrown	c89a8eee61	Allow faulty devices to be removed from a readonly array. Removing faulty devices from an array is a two stage process. First the device is moved from being a part of the active array to being similar to a spare device. Then it can be removed by a request from user space. The first step is currently not performed for read-only arrays, so the second step can never succeed. So allow readonly arrays to remove failed devices (which aren't blocked). Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-05 15:56:32 +10:00
NeilBrown	ac4090d24c	Don't let a blocked_rdev interfere with read request in raid5/6 When we have externally managed metadata, we need to mark a failed device as 'Blocked' and not allow any writes until that device have been marked as faulty in the metadata and the Blocked flag has been removed. However it is perfectly OK to allow read requests when there is a Blocked device, and with a readonly array, there may not be any metadata-handler watching for blocked devices. So in raid5/raid6 only allow a Blocked device to interfere with Write request or resync. Read requests go through untouched. raid1 and raid10 already differentiate between read and write properly. Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-05 15:56:32 +10:00
NeilBrown	dba034eef2	Fail safely when trying to grow an array with a write-intent bitmap. We cannot currently change the size of a write-intent bitmap. So if we change the size of an array which has such a bitmap, it tries to set bits beyond the end of the bitmap. For now, simply reject any request to change the size of an array which has a bitmap. mdadm can remove the bitmap and add a new one after the array has changed size. Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-05 15:56:32 +10:00
NeilBrown	2b25000bf5	Restore force switch of md array to readonly at reboot time. A recent patch allowed do_md_stop to know whether it was being called via an ioctl or not, and thus where to allow for an extra open file descriptor when checking if it is in use. This broke then switch to readonly performed by the shutdown notifier, which needs to work even when the array is still (apparently) active (as md doesn't get told when the filesystem becomes readonly). So restore this feature by pretending that there can be lots of file descriptors open, but we still want do_md_stop to switch to readonly. Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-05 15:56:31 +10:00
NeilBrown	19052c0e85	Make writes to md/safe_mode_delay immediately effective. If we reduce the 'safe_mode_delay', it could still wait for the old delay to completely expire before doing anything about safe_mode. Thus the effect if the change is delayed. To make the effect more immediate, run the timeout function immediately if the delay was reduced. This may cause it to run slightly earlier that required, but that is the safer option. Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-05 15:56:31 +10:00
Linus Torvalds	1e24b15b26	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: raid10: wake up frozen array md: do not count blocked devices as spares md: do not progress the resync process if the stripe was blocked md: delay notification of 'active_idle' to the recovery thread md: fix merge error md: move async_tx_issue_pending_all outside spin_lock_irq	2008-08-01 11:56:07 -07:00
Linus Torvalds	b17b3d479c	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: md: the bitmap code needs to use blk_plug_device_unlocked() block: add a blk_plug_device_unlocked() that grabs the queue lock	2008-08-01 11:46:00 -07:00
Jens Axboe	93769f5807	md: the bitmap code needs to use blk_plug_device_unlocked() It doesn't hold the queue lock, so it's both racey on the queue flags and thus spews a warning. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-08-01 20:32:31 +02:00
Al Viro	d5686b444f	[PATCH] switch mtd and dm-table to lookup_bdev() No need to open-code it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-08-01 11:25:31 -04:00
Arthur Jones	388667bed5	md: raid10: wake up frozen array When rescheduling a bio in raid10, we wake up the md thread, but if the array is frozen, this will have no effect. This causes the array to remain frozen for eternity. We add a wake_up to allow the array to de-freeze. This code is nearly identical to the raid1 code, which has this fix already. Signed-off-by: Arthur Jones <ajones@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-01 12:55:14 +10:00
Dan Williams	e542713529	md: do not count blocked devices as spares remove_and_add_spares() assumes that failed devices have been hot-removed from the array. Removal is skipped in the 'blocked' case so do not count a device in this state as 'spare'. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2008-07-28 17:52:44 -07:00
Dan Williams	df10cfbc4d	md: do not progress the resync process if the stripe was blocked handle_stripe will take no action on a stripe when waiting for userspace to unblock the array, so do not report completed sectors. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2008-07-28 17:52:37 -07:00

... 3 4 5 6 7 ...

1383 Commits