Commit Graph

2690 Commits

Author SHA1 Message Date
Shaohua Li 32f9f570d0 MD: ignore discard request for hard disks of hybid raid1/raid10 array
In SSD/hard disk hybid storage, discard request should be ignored for hard
disk. We used to be doing this way, but the unplug path forgets it.

This is suitable for stable tree since v3.6.

Cc: stable@vger.kernel.org
Reported-and-tested-by: Markus <M4rkusXXL@web.de>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-30 14:49:36 +10:00
NeilBrown 486adf72cc md: bad block list should default to disabled.
Maintenance of a bad-block-list currently defaults to 'enabled'
and is then disabled when it cannot be supported.
This is backwards and causes problem for dm-raid which didn't know
to disable it.

So fix the defaults, and only enabled for v1.x metadata which
explicitly has bad blocks enabled.

The problem with dm-raid has been present since badblock support was
added in v3.1, so this patch is suitable for any -stable from 3.1
onwards.

Cc: stable@vger.kernel.org (3.1+)
Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-30 14:49:32 +10:00
Hirokazu Takahashi 0fea7ed82b md: raid1/raid10 md devices leak memory when stopping
Hi.

Raid1 and raid10 devices leak memory every time they stop.
This is a patch for linux-3.9.0-rc7 to fix this problem.

Thanks,
Hirokazu Takahashi.

Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-30 14:49:26 +10:00
Jonathan Brassow be83651f00 DM RAID: Add message/status support for changing sync action
DM RAID:  Add message/status support for changing sync action

This patch adds a message interface to dm-raid to allow the user to more
finely control the sync actions being performed by the MD driver.  This
gives the user the ability to initiate "check" and "repair" (i.e. scrubbing).
Two additional fields have been appended to the status output to provide more
information about the type of sync action occurring and the results of those
actions, specifically: <sync_action> and <mismatch_cnt>.  These new fields
will always be populated.  This is essentially the device-mapper way of doing
what MD controls through the 'sync_action' sysfs file and shows through the
'mismatch_cnt' sysfs file.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:43 +10:00
Jonathan Brassow a91d5ac048 MD: Export 'md_reap_sync_thread' function
MD: Export 'md_reap_sync_thread' function

Make 'md_reap_sync_thread' available to other files, specifically dm-raid.c.
- rename reap_sync_thread to md_reap_sync_thread
- move the fn after md_check_recovery to match md.h declaration placement
- export md_reap_sync_thread

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:43 +10:00
NeilBrown b6d428c669 md: don't update metadata when stopping a read-only array.
read-only arrays should stay that way as much as possible.
Updating the metadata - which could be triggered by a re-add
while assembling the array metadata - should be avoided.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:42 +10:00
NeilBrown 7ceb17e87b md: Allow devices to be re-added to a read-only array.
When assembling an array incrementally we might want to make
it device available when "enough" devices are present, but maybe
not "all" devices are present.
If the remaining devices appear before the array is actually used,
they should be added transparently.

We do this by using the "read-auto" mode where the array acts like
it is read-only until a write request arrives.

Current an add-device request switches a read-auto array to active.
This means that only one device can be added after the array is first
made read-auto.  This isn't a problem for RAID5, but is not ideal for
RAID6 or RAID10.
Also we don't really want to switch the array to read-auto at all
when re-adding a device as this doesn't really imply any change.

So:
 - remove the "md_update_sb()" call from add_new_disk().  This isn't
   really needed as just adding a disk doesn't require a metadata
   update.  Instead, just set MD_CHANGE_DEVS.  This will effect a
   metadata update soon enough, once the array is not read-only.

 - Allow the ADD_NEW_DISK ioctl to succeed without activating a
   read-auto array, providing the MD_DISK_SYNC flag is set.
   In this case, the device will be rejected if it cannot be added
   with the correct device number, or has an incorrect event count.

 - Teach remove_and_add_spares() to be careful about adding spares
   when the array is read-only (or read-mostly) - only add devices
   that are thought to be in-sync, and only do it if the array is
   in-sync itself.

 - In md_check_recovery, use remove_and_add_spares in the read-only
   case, rather than open coding just the 'remove' part of it.

Reported-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:42 +10:00
Martin Wilck 7e83ccbecd md/raid10: Allow skipping recovery when clean arrays are assembled
When an array is assembled incrementally with mdadm -I -R
and the array switches to "active" mode, md starts a recovery.

If the array was clean, the "fullsync" flag will be 0. Skip
the full recovery in this case, as RAID1 does (the code was
actually copied from the sync_request() method of RAID1).

Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:42 +10:00
NeilBrown c0b32972fb md/raid5: avoid an extra write when writing to a known-bad-block.
If we write to a known-bad-block it will be flags as having
a ReadError by analyse_stripe, but the write will proceed anyway
(as it should).  Then the read-error handling will kick in an
write again, then re-read.

We don't need that 'write-again', so set R5_ReWrite so it looks like
it has already been done.  Then we will just get the re-read, which we
want.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:42 +10:00
majianpeng 6f608040ce md/raid5: Change or of some order to improve efficiency.
As the function call is the most expensive of these tests it should be
done later in the chain so that it can be avoided in some cases.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:41 +10:00
Akinobu Mita 3f810b6c4a md: use set_bit_le and clear_bit_le
The value returned by test_and_set_bit_le() drivers/md/bitmap.c is not used.
So just use set_bit_le(). The same goes for test_and_clear_bit_le().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:41 +10:00
NeilBrown 3ea8929da3 md: HOT_DISK_REMOVE shouldn't make a read-auto device active.
If a fail device or a spare is removed from an array, there is
not need to make the array 'active'.  If/when the array does become
active for some other reason the metadata will be update to reflect
the removal.
If that never happens and the array is stopped while still read-auto,
then there is no loss in forgetting the that the device had 'failed'.

A read-only array will leave failed devices attached to
the array personality, so we need to explicitly call
remove_and_add_spares() to free it (clearing Blocked just
like we do in store_slot()).

Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:41 +10:00
NeilBrown 746d3207ae md: use common code for all calls to ->hot_remove_disk()
slot_store and remove_and_add_spares both call ->hot_remove_disk(),
but with slightly different tests and consequences, which is
at least untidy and might be buggy.

So modify remove_and_add_spaces() so that it can be asked
to remove a specific device, and call it from slot_store().

We also clear the Blocked flag to ensure that doesn't prevent
removal.  The purpose of Blocked is to prevent automatic removal
by the kernel before an error is acknowledged.
If the array is read/write then user-space would have not reason
to remove a device unless it was known to be 'spare' or 'faulty' in
which it would have already cleared the Blocked flag.
If the array is read-only, the flag might still be blocked, but
there is no harm in clearing the flag for read-only arrays.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:41 +10:00
NeilBrown d87f064f58 md: never update metadata when array is read-only.
Normally we don't even try to update the metadata if
the array is read-only.  However future patches
will increase the number of things that can happen on a read-only
array, so it is safest to explicitly disable this.

Every time that mddev->ro is set to 0, either
 - md_update_sb will be called again (at least if MD_CHANGE_DEVS
   is set) or
 - the mddev->thread is scheduled, which will also run
   md_update_sb if needed.

So this is safe: if the array ever become read-write the
metadata will be updated.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24 11:42:40 +10:00
Linus Torvalds 0a82a8d132 Revert "block: add missing block_bio_complete() tracepoint"
This reverts commit 3a366e614d.

Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.

Jens says:
 "It's not quite clear why that is yet, so I think we should just revert
  the commit for 3.9 final (which I'm assuming is pretty close).

  The wifi is crap at the LSF hotel, so sending this email instead of
  queueing up a revert and pull request."

Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Requested-by: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-18 09:00:26 -07:00
Mike Snitzer 19b0092e26 dm cache: reduce bio front_pad size in writeback mode
A recent patch to fix the dm cache target's writethrough mode extended
the bio's front_pad to include a 1056-byte struct dm_bio_details.
Writeback mode doesn't need this, so this patch reduces the
per_bio_data_size to 16 bytes in this case instead of 1096.

The dm_bio_details structure was added in "dm cache: fix writes to
cache device in writethrough mode" which fixed commit e2e74d617e ("dm
cache: fix race in writethrough implementation").  In writeback mode
we avoid allocating the writethrough-specific members of the
per_bio_data structure (the dm_bio_details structure included).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-04-05 15:36:34 +01:00
Darrick J. Wong b844fe6918 dm cache: fix writes to cache device in writethrough mode
The dm-cache writethrough strategy introduced by commit e2e74d617e
("dm cache: fix race in writethrough implementation") issues a bio to
the origin device, remaps and then issues the bio to the cache device.
This more conservative in-series approach was selected to favor
correctness over performance (of the previous parallel writethrough).
However, this in-series implementation that reuses the same bio to write
both the origin and cache device didn't take into account that the block
layer's req_bio_endio() modifies a completing bio's bi_sector and
bi_size.  So the new writethrough strategy needs to preserve these bio
fields, and restore them before submission to the cache device,
otherwise nothing gets written to the cache (because bi_size is 0).

This patch adds a struct dm_bio_details field to struct per_bio_data,
and uses dm_bio_record() and dm_bio_restore() to ensure the bio is
restored before reissuing to the cache device.  Adding such a large
structure to the per_bio_data is not ideal but we can improve this
later, for now correctness is the important thing.

This problem initially went unnoticed because the dm-cache test-suite
uses a linear DM device for the dm-cache device's origin device.
Writethrough worked as expected because DM submits a *clone* of the
original bio, so the original bio which was reused for the cache was
never touched.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-04-05 15:36:32 +01:00
Linus Torvalds 22c3f2fff6 A few bugfixes for md
- recent regressions in raid5
  - recent regressions in dmraid
  - a few instances of CONFIG_MULTICORE_RAID456 linger
 
 Several tagged for -stable
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUUzCwDnsnt1WYoG5AQJKMhAAsi2XhqLC4Dx19J8MTF6+cjfynWCxF2SC
 3mMcVZm6yxSowixb1Ht72CyssWdJAi4vgaw0aLNH7b3CbPDZfTSfqLP4tSvyPfod
 aDcFDdd/RhHjDpJqZ52Tyc6QzBfyhwu+s9R+a78TSL47ZMjZpz1QpshG8Sm9JYTs
 z72VlIZeglzhWmzO1FInsL/oT/Hwr9IfpmJpuXBQQObDn3BgvZLuzZyCi35upqrM
 711ei7CKaN0s/jKcWclNRtgUrr10XsgQ6PugOZbli09CC8ushHwvXe/VmxoQFg2+
 Sj14YSfYAY+1QpOiuYc+knrWc7CtPGHgUqBzOoYWMxi9Lqpo5xhD1vkRsFhXxMSg
 GVnAnh/RXl7bGzGWaRv8twG4vU+qYOlEPNgO6/079AxCOrrNrstYrgjBxBSWuxrB
 0UIFQGT69zA5G3cLbIRrXUxO8oIVeEx92YV1TOcgLKP5OXlp/0I8ajnA9b8KoPZa
 He04GdPlZMXTLAqq9MaQRdS0XzX8YQDWbUebqe+w5NW46sLbckkmxaNZs7fOYAfG
 CNHfeRsLp5v0oNbhNyCDSjxqH6uYwKCdCqmDxo6A+fmjmDruHQmZoAK8YISUtPtx
 u4M82jW6Z/xOg4pomxMl4SxzCDhy1pM8PYzyx7Mj82C4XBR8CkrQTP8XD+FQL2Ih
 KhId4tJzx6Q=
 =Rycs
 -----END PGP SIGNATURE-----

Merge tag 'md-3.9-fixes' of git://neil.brown.name/md

Pull md fixes from NeilBrown:
 "A few bugfixes for md

   - recent regressions in raid5
   - recent regressions in dmraid
   - a few instances of CONFIG_MULTICORE_RAID456 linger

  Several tagged for -stable"

* tag 'md-3.9-fixes' of git://neil.brown.name/md:
  md: remove CONFIG_MULTICORE_RAID456 entirely
  md/raid5: ensure sync and DISCARD don't happen at the same time.
  MD: Prevent sysfs operations on uninitialized kobjects
  MD RAID5: Avoid accessing gendisk or queue structs when not available
  md/raid5: schedule_construction should abort if nothing to do.
2013-03-23 15:49:49 -07:00
Mike Snitzer ea2dd8c1ed dm cache: policy ignore hints if generated by different version
When reading the dm cache metadata from disk, ignore the policy hints
unless they were generated by the same major version number of the same
policy module.

The hints are considered to be private data belonging to the specific
module that generated them and there is no requirement for them to make
sense to different versions of the policy that generated them.
Policy modules are all required to work fine if no previous hints are
supplied (or if existing hints are lost).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:28 +00:00
Mike Snitzer 4e7f506f64 dm cache: policy change version from string to integer set
Separate dm cache policy version string into 3 unsigned numbers
corresponding to major, minor and patchlevel and store them at the end
of the on-disk metadata so we know which version of the policy generated
the hints in case a future version wants to use them differently.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:27 +00:00
Joe Thornber e2e74d617e dm cache: fix race in writethrough implementation
We have found a race in the optimisation used in the dm cache
writethrough implementation.  Currently, dm core sends the cache target
two bios, one for the origin device and one for the cache device and
these are processed in parallel.  This patch avoids the race by
changing the code back to a simpler (slower) implementation which
processes the two writes in series, one after the other, until we can
develop a complete fix for the problem.

When the cache is in writethrough mode it needs to send WRITE bios to
both the origin and cache devices.

Previously we've been implementing this by having dm core query the
cache target on every write to find out how many copies of the bio it
wants.  The cache will ask for two bios if the block is in the cache,
and one otherwise.

Then main problem with this is it's racey.  At the time this check is
made the bio hasn't yet been submitted and so isn't being taken into
account when quiescing a block for migration (promotion or demotion).
This means a single bio may be submitted when two were needed because
the block has since been promoted to the cache (catastrophic), or two
bios where only one is needed (harmless).

I really don't want to start entering bios into the quiescing system
(deferred_set) in the get_num_write_bios callback.  Instead this patch
simplifies things; only one bio is submitted by the core, this is
first written to the origin and then the cache device in series.
Obviously this will have a latency impact.

deferred_writethrough_bios is introduced to record bios that must be
later issued to the cache device from the worker thread.  This deferred
submission, after the origin bio completes, is required given that we're
in interrupt context (writethrough_endio).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:27 +00:00
Joe Thornber 79ed9caffc dm cache: metadata clear dirty bits on clean shutdown
When writing the dirty bitset to the metadata device on a clean
shutdown, clear the dirty bits.  Previously they were left indicating
the cache was dirty. This led to confusion about whether there really
was dirty data in the cache or not.  (This was a harmless bug.)

Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:27 +00:00
Heinz Mauelshagen b978440b8d dm cache: avoid calling policy destructor twice on error
If the cache policy's config values are not able to be set we must
set the policy to NULL after destroying it in create_cache_policy()
so we don't attempt to destroy it a second time later.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:26 +00:00
Heinz Mauelshagen 617a0b89da dm cache: detect cache_create failure
Return error if cache_create() fails.

A missing return check made cache_ctr continue even after an error in
cache_create() resulting in the cache object being destroyed.  So a
simple failure like an odd number of cache policy config value arguments
would result in an oops.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:26 +00:00
Joe Thornber 414dd67d50 dm cache: avoid 64 bit division on 32 bit
Squash various 32bit link errors.

  >> on i386:
  >> drivers/built-in.o: In function `is_discarded_oblock':
  >> dm-cache-target.c:(.text+0x1ea28e): undefined reference to `__udivdi3'
  ...

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:25 +00:00
Mikulas Patocka 3b6b7813b1 dm verity: avoid deadlock
A deadlock was found in the prefetch code in the dm verity map
function.  This patch fixes this by transferring the prefetch
to a worker thread and skipping it completely if kmalloc fails.

If generic_make_request is called recursively, it queues the I/O
request on the current->bio_list without making the I/O request
and returns. The routine making the recursive call cannot wait
for the I/O to complete.

The deadlock occurs when one thread grabs the bufio_client
mutex and waits for an I/O to complete but the I/O is queued
on another thread's current->bio_list and is waiting to get
the mutex held by the first thread.

The fix recognises that prefetching is not essential.  If memory
can be allocated, it queues the prefetch request to the worker thread,
but if not, it does nothing.

Signed-off-by: Paul Taysom <taysom@chromium.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2013-03-20 17:21:25 +00:00
Joe Thornber 58051b94e0 dm thin: fix non power of two discard granularity calc
Fix a discard granularity calculation to work for non power of 2 block sizes.

In order for thinp to passdown discard bios to the underlying data
device, the data device must have a discard granularity that is a
factor of the thinp block size.  Originally this check was done by
using bitops since the block_size was known to be a power of two.

Introduced by commit f13945d757
("dm thin: support a non power of 2 discard_granularity").

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:25 +00:00
Joe Thornber f046f89a99 dm thin: fix discard corruption
Fix a bug in dm_btree_remove that could leave leaf values with incorrect
reference counts.  The effect of this was that removal of a shared block
could result in the space maps thinking the block was no longer used.
More concretely, if you have a thin device and a snapshot of it, sending
a discard to a shared region of the thin could corrupt the snapshot.

Thinp uses a 2-level nested btree to store it's mappings.  This first
level is indexed by thin device, and the second level by logical
block.

Often when we're removing an entry in this mapping tree we need to
rebalance nodes, which can involve shadowing them, possibly creating a
copy if the block is shared.  If we do create a copy then children of
that node need to have their reference counts incremented.  In this
way reference counts percolate down the tree as shared trees diverge.

The rebalance functions were incrementing the children at the
appropriate time, but they were always assuming the children were
internal nodes.  This meant the leaf values (in our case packed
block/flags entries) were not being incremented.

Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20 17:21:24 +00:00
Paul Bolle 238f5908bd md: remove CONFIG_MULTICORE_RAID456 entirely
Once instance of this Kconfig macro remained after commit
51acbcec6c ("md: remove
CONFIG_MULTICORE_RAID456"). Remove that one too. And, while we're at it,
also remove it from the defconfig files that carry it.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-20 13:21:14 +11:00
NeilBrown f8dfcffd04 md/raid5: ensure sync and DISCARD don't happen at the same time.
A number of problems can occur due to races between
resync/recovery and discard.

- if sync_request calls handle_stripe() while a discard is
  happening on the stripe, it might call handle_stripe_clean_event
  before all of the individual discard requests have completed
  (so some devices are still locked, but not all).
  Since commit ca64cae960
     md/raid5: Make sure we clear R5_Discard when discard is finished.
  this will cause R5_Discard to be cleared for the parity device,
  so handle_stripe_clean_event() will not be called when the other
  devices do become unlocked, so their ->written will not be cleared.
  This ultimately leads to a WARN_ON in init_stripe and a lock-up.

- If handle_stripe_clean_event() does clear R5_UPTODATE at an awkward
  time for resync, it can lead to s->uptodate being less than disks
  in handle_parity_checks5(), which triggers a BUG (because it is
  one).

So:
 - keep R5_Discard on the parity device until all other devices have
   completed their discard request
 - make sure we don't try to have a 'discard' and a 'sync' action at
   the same time.
   This involves a new stripe flag to we know when a 'discard' is
   happening, and the use of R5_Overlap on the parity disk so when a
   discard is wanted while a sync is active, so we know to wake up
   the discard at the appropriate time.

Discard support for RAID5 was added in 3.7, so this is suitable for
any -stable kernel since 3.7.

Cc: stable@vger.kernel.org (v3.7+)
Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-20 13:20:59 +11:00
Jonathan Brassow 90584fc93d MD: Prevent sysfs operations on uninitialized kobjects
MD: Prevent sysfs operations on uninitialized kobjects

Device-mapper does not use sysfs; but when device-mapper is leveraging
MD's RAID personalities, MD sometimes attempts to update sysfs.  This
patch adds checks for 'mddev-kobj.sd' in sysfs_[un]link_rdev to ensure
it is about to operate on something valid.  This patch also checks for
'mddev->kobj.sd' before calling 'sysfs_notify' in 'remove_and_add_spares'.
Although 'sysfs_notify' already makes this check, doing so in
'remove_and_add_spares' prevents an additional mutex operation.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-20 13:17:57 +11:00
Jonathan Brassow e3620a3ad5 MD RAID5: Avoid accessing gendisk or queue structs when not available
MD RAID5:  Fix kernel oops when RAID4/5/6 is used via device-mapper

Commit a9add5d (v3.8-rc1) added blktrace calls to the RAID4/5/6 driver.
However, when device-mapper is used to create RAID4/5/6 arrays, the
mddev->gendisk and mddev->queue fields are not setup.  Therefore, calling
things like trace_block_bio_remap will cause a kernel oops.  This patch
conditionalizes those calls on whether the proper fields exist to make
the calls.  (Device-mapper will call trace_block_bio_remap on its own.)

This patch is suitable for the 3.8.y stable kernel.

Cc: stable@vger.kernel.org (v3.8+)
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-20 13:16:57 +11:00
NeilBrown ce7d363aaf md/raid5: schedule_construction should abort if nothing to do.
Since commit 1ed850f356
    md/raid5: make sure to_read and to_write never go negative.

It has been possible for handle_stripe_dirtying to be called
when there isn't actually any work to do.
It then calls schedule_reconstruction() which will set R5_LOCKED
on the parity block(s) even when nothing else is happening.
This then causes problems in do_release_stripe().

So add checks to schedule_reconstruction() so that if it doesn't
find anything to do, it just aborts.

This bug was introduced in v3.7, so the patch is suitable
for -stable kernels since then.

Cc: stable@vger.kernel.org (v3.7+)
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-20 12:16:51 +11:00
Linus Torvalds a5e0d73163 md updates for 3.9
mostly little bugfixes.
 Only "feature" is a new RAID10 layout which slightly
 improves the number of sets of devices that can concurrently
 fail, without data loss.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUTPm+znsnt1WYoG5AQLLsw/+PMqr8roC4twgxTWV1NRbU8NtOcRi9Rj9
 uvBS63uYAaLdi/D3UBKFYczmNCu9knuXbcp9SgFDxH7LlthQsWN/GYnif06pPo3w
 9Agu5M8c062TJEG1vrnX6FhPO6pNgrWFr3h+CKkTiD3179i9DoQpP8LXQToeyMtI
 YRMQf/zCkxYtDvWAP0iwsEWtw8cf+q9I/uGPhQ1L+DnZapXYdbtnqWBRz9q6mrDt
 orcGrP41aZHvnOHUaTbwmaorCKkf/Ys4SMaGenrSFpnpQMypt7VgNuwHC59LxvJT
 5eiFG/26zIsv7Wk0jv/TvFP5qzUPo0/PFkd5ug0ArvbVRiXS2cMJDwQvMdO1toxD
 i5Bb+P9DptadvoWhOTgIpxnG77yRH45wJvyJOk+ZfS1/IO87nCRa3d0yiNOU5e2/
 o0VdXPZRr72sdKKTK6kQuYfwCPb+Z2Pz6Q8BJdk6GxlmTXyP6sKhIgwUX86534fE
 LrOxfK8qV+GetVu3X02RoX2CyJJRQHXyXmbHuSzXuo/JiOYtDigAydwNZChvf+tf
 OoMY9K8vgNbhnGsUG6la7XPvZ+6dZMjdnxp2HB99Ml5A3PWZd75i5T6IHHxIQFbD
 C3z9PWTWP+hK4k15DEyjlELtsE9WduGTXG4kUcf328xJ/7lj4VIImVugdCz+1B6z
 +HlI6BiLwzY=
 =YdVD
 -----END PGP SIGNATURE-----

Merge tag 'md-3.9' of git://neil.brown.name/md

Pull md updates from NeilBrown:
 "Mostly little bugfixes.

  Only "feature" is a new RAID10 layout which slightly improves the
  number of sets of devices that can concurrently fail, without data
  loss."

* tag 'md-3.9' of git://neil.brown.name/md:
  md: expedite metadata update when switching  read-auto -> active
  md: remove CONFIG_MULTICORE_RAID456
  md/raid1,raid10: fix deadlock with freeze_array()
  md/raid0: improve error message when converting RAID4-with-spares to RAID0
  md: raid0: fix error return from create_stripe_zones.
  md: fix two bugs when attempting to resize RAID0 array.
  DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms
  MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2)
  MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)
  MD RAID10: Minor non-functional code changes
  md: raid1,10: Handle REQ_WRITE_SAME flag in write bios
  md: protect against crash upon fsync on ro array
2013-03-05 17:22:08 -08:00
Heinz Mauelshagen 8735a81347 dm cache: add cleaner policy
A simple cache policy that writes back all data to the origin.

This is used to decommission a dm cache by emptying it.

Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:52 +00:00
Joe Thornber f283635281 dm cache: add mq policy
A cache policy that uses a multiqueue ordered by recent hit
count to select which blocks should be promoted and demoted.
This is meant to be a general purpose policy.  It prioritises
reads over writes.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:51 +00:00
Joe Thornber c6b4fcbad0 dm: add cache target
Add a target that allows a fast device such as an SSD to be used as a
cache for a slower device such as a disk.

A plug-in architecture was chosen so that the decisions about which data
to migrate and when are delegated to interchangeable tunable policy
modules.  The first general purpose module we have developed, called
"mq" (multiqueue), follows in the next patch.  Other modules are
under development.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:51 +00:00
Joe Thornber 7a87edfee7 dm persistent data: add bitset
Add a persistent bitset as a wrapper around dm-array.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:51 +00:00
Joe Thornber 6513c29f44 dm persistent data: add transactional array
Add a transactional array.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:51 +00:00
Joe Thornber 025b96853f dm thin: remove cells from stack
This patch takes advantage of the new bio-prison interface where the
memory is now passed in rather than using a mempool in bio-prison.
This allows the map function to avoid performing potentially-blocking
allocations that could lead to deadlocks: We want to avoid the cell
allocation that is done in bio_detain.

(The potential for mempool deadlocks still remains in other functions
that use bio_detain.)

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:50 +00:00
Joe Thornber 6beca5eb6e dm bio prison: pass cell memory in
Change the dm_bio_prison interface so that instead of allocating memory
internally, dm_bio_detain is supplied with a pre-allocated cell each
time it is called.

This enables a subsequent patch to move the allocation of the struct
dm_bio_prison_cell outside the thin target's mapping function so it can
no longer block there.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:50 +00:00
Joe Thornber 4e7f1f9089 dm persistent data: add btree_walk
Add dm_btree_walk to iterate through the contents of a btree.
This will be used by the dm cache target.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:50 +00:00
Alasdair G Kergon b0d8ed4d96 dm: add target num_write_bios fn
Add a num_write_bios function to struct target.

If an instance of a target sets this, it will be queried before the
target's mapping function is called on a write bio, and the response
controls the number of copies of the write bio that the target will
receive.

This provides a convenient way for a target to send the same data to
more than one device.  The new cache target uses this in writethrough
mode, to send the data both to the cache and the backing device.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:49 +00:00
Mikulas Patocka df5d2e9089 dm kcopyd: introduce configurable throttling
This patch allows the administrator to reduce the rate at which kcopyd
issues I/O.

Each module that uses kcopyd acquires a throttle parameter that can be
set in /sys/module/*/parameters.

We maintain a history of kcopyd usage by each module in the variables
io_period and total_period in struct dm_kcopyd_throttle. The actual
kcopyd activity is calculated as a percentage of time equal to
"(100 * io_period / total_period)".  This is compared with the user-defined
throttle percentage threshold and if it is exceeded, we sleep.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:49 +00:00
Mikulas Patocka a26062416e dm ioctl: allow message to return data
This patch introduces enhanced message support that allows the
device-mapper core to recognise messages that are common to all devices,
and for messages to return data to userspace.

Core messages are processed by the function "message_for_md".  If the
device mapper doesn't support the message, it is passed to the target
driver.

If the message returns data, the kernel sets the flag
DM_MESSAGE_OUT_FLAG.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:49 +00:00
Mikulas Patocka 02cde50b7e dm ioctl: optimize functions without variable params
Device-mapper ioctls receive and send data in a buffer supplied
by userspace.  The buffer has two parts.  The first part contains
a 'struct dm_ioctl' and has a fixed size.  The second part depends
on the ioctl and has a variable size.

This patch recognises the specific ioctls that do not use the variable
part of the buffer and skips allocating memory for it.

In particular, when a device is suspended and a resume ioctl is sent,
this now avoid memory allocation completely.

The variable "struct dm_ioctl tmp" is moved from the function
copy_params to its caller ctl_ioctl and renamed to param_kernel.
It is used directly when the ioctl function doesn't need any arguments.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:49 +00:00
Mikulas Patocka e2914cc26b dm ioctl: introduce ioctl_flags
This patch introduces flags for each ioctl function.

So far, one flag is defined, IOCTL_FLAGS_NO_PARAMS.  It is set if the
function processing the ioctl doesn't take or produce any parameters in
the section of the data buffer that has a variable size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:48 +00:00
Jun'ichi Nomura 5f01520415 dm: merge io_pool and tio_pool
This patch merges io_pool and tio_pool into io_pool and cleans up
related functions.

Though device-mapper used to have 2 pools of objects for each dm device,
the use of bioset frontbad for per-bio data has shrunk the number of
pools to 1 for both bio-based and request-based device types.
(See c0820cf5 "dm: introduce per_bio_data" and
 94818742 "dm: Use bioset's front_pad for dm_rq_clone_bio_info")

So dm no longer has to maintain 2 different pointers.

No functional changes.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:48 +00:00
Jun'ichi Nomura 23e5083b4d dm: remove unused _rq_bio_info_cache
Remove _rq_bio_info_cache, which is no longer used.
No functional changes.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:48 +00:00
Mike Christie 87eb5b21d9 dm: fix limits initialization when there are no data devices
dm_calculate_queue_limits will first reset the provided limits to
defaults using blk_set_stacking_limits; whereby defeating the purpose of
retaining the original live table's limits -- as was intended via commit
3ae7065616 ("dm: retain table limits when
swapping to new table with no devices").

Fix this improper limits initialization (in the no data devices case) by
avoiding the call to dm_calculate_queue_limits.

[patch header revised by Mike Snitzer]

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.6+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01 22:45:48 +00:00