Commit Graph

2606 Commits

Author SHA1 Message Date
NeilBrown deb200d085 md/raid10: split out interpretation of layout to separate function.
We will soon be interpreting the layout (and chunksize etc) from
multiple places to support reshape.  So split it out into separate
function.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:33 +10:00
NeilBrown f8c9e74ff0 md/raid10: Introduce 'prev' geometry to support reshape.
When RAID10 supports reshape it will need a 'previous' and a 'current'
geometry, so introduce that here.
Use the 'prev' geometry when before the reshape_position, and the
current 'geo' when beyond it.  At other times, use both as
appropriate.

For now, both are identical (And reshape_position is never set).

When we use the 'prev' geometry, we must use the old data_offset.
When we use the current (And a reshape is happening) we must use
the new_data_offset.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:33 +10:00
NeilBrown c804cdecea md: use resync_max_sectors for reshape as well as resync.
Some resync type operations need to act on the address space of the
device, others on the address space of the array.

This only affects RAID10, so it sets resync_max_sectors to the array
size (it defaults to the device size), and that is currently used for
resync only.  However reshape of a RAID10 must be done against the
array size, not device size, so change code to use resync_max_sectors
for both the resync and the reshape cases.
This does not affect RAID5 or RAID1, just RAID10.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:33 +10:00
NeilBrown 1fdd6fc92f md: teach sync_page_io about new_data_offset.
Some code in raid1 and raid10 use sync_page_io to
read/write pages when responding to read errors.
As we will shortly support changing data_offset for
raid10, this function must understand new_data_offset.

So add that understanding.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:32 +10:00
NeilBrown 5cf00fcd3c md/raid10: collect some geometry fields into a dedicated structure.
We will shortly be adding reshape support for RAID10 which will
require it having 2 concurrent geometries (before and after).
To make that easier, collect most geometry fields into 'struct geom'
and access them from there.  Then we will more easily be able to add
a second set of fields.

Note that 'copies' is not in this struct and so cannot be changed.
There is little need to change this number and doing so is a lot
more difficult as it requires reallocating more things.
So leave it out for now.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:20 +10:00
NeilBrown b5254dd5fd md/raid5: allow for change in data_offset while managing a reshape.
The important issue here is incorporating the different in data_offset
into calculations concerning when we might need to over-write data
that is still thought to be valid.

To this end we find the minimum offset difference across all devices
and add that where appropriate.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:01 +10:00
NeilBrown 05616be5e1 md/raid5: Use correct data_offset for all IO.
As there can now be two different data_offsets - an 'old' and
a 'new' - we need to carefully choose between them.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:00 +10:00
NeilBrown c6563a8c38 md: add possibility to change data-offset for devices.
When reshaping we can avoid costly intermediate backup by
changing the 'start' address of the array on the device
(if there is enough room).

So as a first step, allow such a change to be requested
through sysfs, and recorded in v1.x metadata.

(As we didn't previous check that all 'pad' fields were zero,
 we need a new FEATURE flag for this.
 A (belatedly) check that all remaining 'pad' fields are
 zero to avoid a repeat of this)

The new data offset must be requested separately for each device.
This allows each to have a different change in the data offset.
This is not likely to be used often but as data_offset can be
set per-device, new_data_offset should be too.

This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
it is never used and never will be.  At the same time we add a new
arg ('in_new') which is currently always zero but will be used more
soon.

When a reshape finishes we will need to update the data_offset
and rdev->sectors.  So provide an exported function to do that.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:00 +10:00
NeilBrown 2c810cddc4 md: allow a reshape operation to be reversed.
Currently a reshape operation always progresses from the start
of the array to the end unless the number of devices is being
reduced, in which case it progressed in the opposite direction.

To reverse a partial reshape which changes the number of devices
you can stop the array and re-assemble with the raid-disks numbers
reversed and it will undo.

However for a reshape that does not change the number of devices
it is not possible to reverse the reshape in the middle - you have to
wait until it completes.

So add a 'reshape_direction' attribute with is either 'forwards' or
'backwards' and can be explicitly set when delta_disks is zero.

This will become more important when we allow the data_offset to
change in a reshape.  Then the explicit statement of what direction is
being used will be more useful.

This can be enabled in raid5 trivially as it already supports
reverse reshape and just needs to use a different trigger to request it.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:00 +10:00
Shaohua Li b5e1b8cee7 md: using GFP_NOIO to allocate bio for flush request
A flush request is usually issued in transaction commit code path, so
using GFP_KERNEL to allocate memory for flush request bio falls into
the classic deadlock issue.

This is suitable for any -stable kernel to which it applies as it
avoids a possible deadlock.

Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:26:59 +10:00
Linus Torvalds b1dab2f040 A fix to the thin provisioning userspace interface.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPtuPVAAoJEK2W1qbAHj1n3q8QAJaM5OkZ3DqAoC8XznVBKO1B
 Fr5d0mui4irC9ce+End/MCoPN2EcwQ5bzyh6cQXqxfUOTR5RjYMCn97uID0I35Cc
 AEbDaUMSgKkQkKaDeVpM54SaHBhtLP95gIZmo774etmryzn3HyWiiYOsvswnAw8w
 /zIZwPkRPoskoGt1Hk/o490qYGwstJ6JErJndyuVEdHNWF6xiL9cFAi81oo1UF2J
 lysWmBBaisRcAD3kzbTBIQ3ntMUPSbvM7ZLUaJXEbXwg+XUBRkBfQLXJk3wA5q8R
 QacTz+kvAXG6B+DipFOYyDTkxR+s5kjRserlBwh2gHP9u3pHV9Nf7eWR5BWsYPbV
 tLH8Q/0Ctl31MsrxMl7cO05Kj2sMI/lOg/yGQT3Qx6k0HVtbU1gaR6J21RmpElGC
 TANXpBhV5MRpfZFvwnHOSACnE3hMm1i4XJstTejIfPiNtq8aeL0eSB3Q+ZJTEZzm
 ZCk23ufmaqC75RrYt1P3F5QpxsjD+moMmZT5hZSkGOd0YqvwAMfv9O/5FesQ748R
 tc5zPxS/dTONseGkBaAPeHpcX4WP0wm0fKJcPb0Y/4kx0AkiVQjbUrNNcnVxo5Pw
 pF1MbxY14xZxteZ/UHWmPvyCuJ8NICTtswxj9taquSDb87pliV2xajrK4Mnmed7u
 sLxA0ANWBdJghss/r/g6
 =03zw
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.4-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull a dm fix from Alasdair G Kergon:
 "A fix to the thin provisioning userspace interface."

* tag 'dm-3.4-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm thin: fix table output when pool target disables discard passdown internally
2012-05-18 18:22:45 -07:00
Mike Snitzer f402693d06 dm thin: fix table output when pool target disables discard passdown internally
When the thin pool target clears the discard_passdown parameter
internally, it incorrectly changes the table line reported to userspace.
This breaks dumb string comparisons on these table lines in generic
userspace device-mapper library code and leads to tables being reloaded
repeatedly when nothing is actually meant to be changing.

This patch corrects this by no longer changing the table line when
discard passdown was disabled.

We can still tell when discard passdown is overridden by looking for the
message "Discard unsupported by data device (sdX): Disabling discard passdown."

This automatic detection is also moved from the 'load' to the 'resume'
so that it is re-evaluated should the properties of underlying devices
change.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-05-19 01:01:01 +01:00
Linus Torvalds 2f05af8b59 Fix bug in recent fix to RAID10.
Without this patch, recovery will crash
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAT7bVGznsnt1WYoG5AQL0ohAAvpF47NkSnuqw9gko25/Ndr5akrIC9kfz
 YNzxsd0FvNSEQipGS4KgBcRxnFxbcVkmHL+pN8McmiDxNew1LW2+zdtV47RJngnQ
 2slTUiSBLSHcvnOFblBrwIh+XpMdhv4FEMG13Si4tQaETtHp3JiNDr2JqvwbuKbo
 aO/nW38DFNjEcB59X+9npknZYaymvavxso4J7iH/ec0Iys2c2h+jX5afewCXnbYr
 Q6X2YlySUiUM3AXbgV0QI/w6xDM0TD+WWguHq411pgU1wHMYpHelYYwRGHSU73TJ
 061K5WB83Q53A8czeRXK33x0xGr/AhQesTel+mlV3eODHoGSnsYxH8Zja7sV220h
 HgTICrSvAFL2cvBz8dqJQ59KYH3yNo4Cie5xzmBxCuX6yyTxizgrQTp7L/HUy7RL
 eL1SqNFTRJcwSl3MwmgbL57saBLw+/ejV+og7PNBCXSS9gZg39uJyn1uTqJZTrcL
 PuBaizNG9HCYuR/pRTMsQGNWrksqk9r+hxwCJDIGb7xh9SBFI82YtQfK9F5fSB84
 vdmIec8yUvwS/Gxhz+I+YB3kqk/nwfFRbVGfBWvJdRBOR0DsZphsBapsJ83LZqFb
 VAa2nQL+NoHOweixv6RgiH4Y96SNHdWkvDJjj7R01BUDdpHeHMZuff07Effgqd64
 Knk64XvWKQc=
 =s7vy
 -----END PGP SIGNATURE-----

Merge tag 'md-3.4-fixes' of git://neil.brown.name/md

Pull one more md bugfix from NeilBrown:
 "Fix bug in recent fix to RAID10.

  Without this patch, recovery will crash"

* tag 'md-3.4-fixes' of git://neil.brown.name/md:
  md/raid10: fix transcription error in calc_sectors conversion.
2012-05-18 16:19:59 -07:00
NeilBrown b0d634d568 md/raid10: fix transcription error in calc_sectors conversion.
The old code was
		sector_div(stride, fc);
the new code was
		sector_dir(size, conf->near_copies);

'size' is right (the stride various wasn't really needed), but
'fc' means 'far_copies', and that is an important difference.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-19 09:01:13 +10:00
Linus Torvalds 36a1987cd8 md: 2 fixes for 3.4
one fixes a bug in the new raid10 resize code so is relevant
 to 3.4 only
 Other fixes a bug in the use of md by dm-raid, so is relevant
 to any kernel with dm-raid support
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAT7SBkznsnt1WYoG5AQK3wQ//Q2sPicPHb5MNGTTBpphYo1QWo+l9jFHs
 ZDBM+MaiNJg3kBN5ueUU+MENvLcaA5+zoxsGVBXBKyXr70ffqiQcLXyU7fHwrGu3
 5MD36p55ZPnq2pemCrp4qdTXEUabmDb+0/R7e5lywnzNdbmCAfh4uYih0VPiaClV
 ihq/Ci12TDnezmLjksc09OCquhm0s3zH2BnMCVdmSAkhnXCxTeZ45s/ob71Y2xvj
 cJ15SYlAG4t0QCikL5R8pZtkh0h2SuUhufDE09eD8yT4RGO4PHSQ4oHujajftzey
 9sB0NGH7Yla8gOXjA+EpzKPaiqtZxJB+1v/bhqA2FoOYAks8VoFfeqgwUbPYE7bk
 GIfGB4hFsUXaJo13uzofyJXBIp9mM/J5Sk1VJsiLE85P7wewg6N199B8lpC3lFDw
 tMLjfTMJzFOUqZBESjJoxyrc4fairZ9VCUWwpqjuioLO50e+lOi/jQHTspX78e+w
 GxgjHp8hh0RqQiTkl7vIz9KVcQIeOTG9uzz61IuDp15cRSrMs6E8gVKoX8gKW9g2
 Hec17fdG/H6ZeZa7MB9GzUD4HCj0PRbODQ3/fPhUdsbgtQjOvsVUH8LCRRU0U6cb
 YF+qsDFtUF7QT2kNbrs9R6adGj97c2HWUMyRWMQAXGuL5TkstvhrRv/rk1+bv2VG
 w7ptbiklj7o=
 =9zxe
 -----END PGP SIGNATURE-----

Merge tag 'md-3.4-fixes' of git://neil.brown.name/md

Pull two md fixes from NeilBrown:
 "One fixes a bug in the new raid10 resize code so is relevant to 3.4
  only.

  The other fixes a bug in the use of md by dm-raid, so is relevant to
  any kernel with dm-raid support"

* tag 'md-3.4-fixes' of git://neil.brown.name/md:
  MD: Add del_timer_sync to mddev_suspend (fix nasty panic)
  md/raid10: set dev_sectors properly when resizing devices in array.
2012-05-17 09:44:35 -07:00
Jonathan Brassow 0d9f4f135e MD: Add del_timer_sync to mddev_suspend (fix nasty panic)
Use del_timer_sync to remove timer before mddev_suspend finishes.

We don't want a timer going off after an mddev_suspend is called.  This is
especially true with device-mapper, since it can call the destructor function
immediately following a suspend.  This results in the removal (kfree) of the
structures upon which the timer depends - resulting in a very ugly panic.
Therefore, we add a del_timer_sync to mddev_suspend to prevent this.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-17 10:38:24 +10:00
NeilBrown 6508fdbf40 md/raid10: set dev_sectors properly when resizing devices in array.
raid10 stores dev_sectors in 'conf' separately from the one in
'mddev' because it can have a very significant effect on block
addressing and so need to be updated carefully.

However raid10_resize isn't updating it at all!

To update it correctly, we need to make sure it is a proper
multiple of the chunksize taking various details of the layout
in to account.
This calculation is currently done in setup_conf.   So split it
out from there and call it from raid10_resize as well.
Then set conf->dev_sectors properly.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-17 10:08:45 +10:00
Linus Torvalds 4a873f5399 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David S. Miller:

 1) Since we do RCU lookups on ipv4 FIB entries, we have to test if the
    entry is dead before returning it to our caller.

 2) openvswitch locking and packet validation fixes from Ansis Atteka,
    Jesse Gross, and Pravin B Shelar.

 3) Fix PM resume locking in IGB driver, from Benjamin Poirier.

 4) Fix VLAN header handling in vhost-net and macvtap, from Basil Gor.

 5) Revert a bogus network namespace isolation change that was causing
    regressions on S390 networking devices.

 6) If bonding decides to process and handle a LACPDU frame, we
    shouldn't bump the rx_dropped counter.  From Jiri Bohac.

 7) Fix mis-calculation of available TX space in r8169 driver when doing
    TSO, which can lead to crashes and/or hung device.  From Julien
    Ducourthial.

 8) SCTP does not validate cached routes properly in all cases, from
    Nicolas Dichtel.

 9) Link status interrupt needs to be handled in ks8851 driver, from
    Stephen Boyd.

10) Use capable(), not cap_raised(), in connector/userns netlink code.
    From Eric W. Biederman via Andrew Morton.

11) Fix pktgen OOPS on module unload, from Eric Dumazet.

12) iwlwifi under-estimates SKB truesizes, also from Eric Dumazet.

13) Cure division by zero in SFC driver, from Ben Hutchings.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
  ks8851: Update link status during link change interrupt
  macvtap: restore vlan header on user read
  vhost-net: fix handle_rx buffer size
  bonding: don't increase rx_dropped after processing LACPDUs
  connector/userns: replace netlink uses of cap_raised() with capable()
  sctp: check cached dst before using it
  pktgen: fix crash at module unload
  Revert "net: maintain namespace isolation between vlan and real device"
  ehea: fix losing of NEQ events when one event occurred early
  igb: fix rtnl race in PM resume path
  ipv4: Do not use dead fib_info entries.
  r8169: fix unsigned int wraparound with TSO
  sfc: Fix division by zero when using one RX channel and no SR-IOV
  openvswitch: Validation of IPv6 set port action uses IPv4 header
  net: compare_ether_addr[_64bits]() has no ordering
  cdc_ether: Ignore bogus union descriptor for RNDIS devices
  bnx2x: bug fix when loading after SAN boot
  e1000: Silence sparse warnings by correcting type
  igb, ixgbe: netdev_tx_reset_queue incorrectly called from tx init path
  openvswitch: Release rtnl_lock if ovs_vport_cmd_build_info() failed.
  ...
2012-05-12 12:57:01 -07:00
Mike Snitzer 510193a2d3 dm mpath: check if scsi_dh module already loaded before trying to load
If the requested scsi_dh module is already loaded then skip
request_module().

Multipath table loads can hang in an unnecessary __request_module.

Reported-by: Ben Marzinski <bmarzins@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-05-12 01:43:21 +01:00
Alasdair G Kergon 7cab8bf160 dm thin: correct module description
Remove duplicate copy of string "device-mapper" (DM_NAME) from
MODULE_DESCRIPTION.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-05-12 01:43:19 +01:00
Mike Snitzer c3a0ce2eab dm thin: fix unprotected use of prepared_discards list
Fix two places in commit 104655fd4d ("dm thin: support discards") that
didn't use pool->lock to protect against concurrent changes to the
prepared_discards list.

Without this fix, thin_endio() can race with process_discard(), leading
to concurrent list_add()s that result in the processes locking up with
an error like the following:

WARNING: at lib/list_debug.c:32 __list_add+0x8f/0xa0()
...
list_add corruption. next->prev should be prev (ffff880323b96140), but was ffff8801d2c48440. (next=ffff8801d2c485c0).
...
Pid: 17205, comm: kworker/u:1 Tainted: G        W  O 3.4.0-rc3.snitm+ #1
Call Trace:
 [<ffffffff8103ca1f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff8103cb16>] warn_slowpath_fmt+0x46/0x50
 [<ffffffffa04f6ce6>] ? bio_detain+0xc6/0x210 [dm_thin_pool]
 [<ffffffff8124ff3f>] __list_add+0x8f/0xa0
 [<ffffffffa04f70d2>] process_discard+0x2a2/0x2d0 [dm_thin_pool]
 [<ffffffffa04f6a78>] ? remap_and_issue+0x38/0x50 [dm_thin_pool]
 [<ffffffffa04f7c3b>] process_deferred_bios+0x7b/0x230 [dm_thin_pool]
 [<ffffffffa04f7df0>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool]
 [<ffffffffa04f7e42>] do_worker+0x52/0x60 [dm_thin_pool]
 [<ffffffff81056fa9>] process_one_work+0x129/0x450
 [<ffffffff81059b9c>] worker_thread+0x17c/0x3c0
 [<ffffffff81059a20>] ? manage_workers+0x120/0x120
 [<ffffffff8105eabe>] kthread+0x9e/0xb0
 [<ffffffff814ceda4>] kernel_thread_helper+0x4/0x10
 [<ffffffff8105ea20>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff814ceda0>] ? gs_change+0x13/0x13
---[ end trace 7e0a523bc5e52692 ]---

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-05-12 01:43:16 +01:00
Mike Snitzer 03aaae7cdc dm thin: reinstate missing mempool_free in cell_release_singleton
Fix a significant memory leak inadvertently introduced during
simplification of cell_release_singleton() in commit
6f94a4c45a ("dm thin: fix stacked bi_next
usage").

A cell's hlist_del() must be accompanied by a mempool_free().
Use __cell_release() to do this, like before.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-05-12 01:43:12 +01:00
Eric W. Biederman 38bf195398 connector/userns: replace netlink uses of cap_raised() with capable()
In 2009 Philip Reiser notied that a few users of netlink connector
interface needed a capability check and added the idiom
cap_raised(nsp->eff_cap, CAP_SYS_ADMIN) to a few of them, on the premise
that netlink was asynchronous.

In 2011 Patrick McHardy noticed we were being silly because netlink is
synchronous and removed eff_cap from the netlink_skb_params and changed
the idiom to cap_raised(current_cap(), CAP_SYS_ADMIN).

Looking at those spots with a fresh eye we should be calling
capable(CAP_SYS_ADMIN).  The only reason I can see for not calling capable
is that it once appeared we were not in the same task as the caller which
would have made calling capable() impossible.

In the initial user_namespace the only difference between between
cap_raised(current_cap(), CAP_SYS_ADMIN) and capable(CAP_SYS_ADMIN) are a
few sanity checks and the fact that capable(CAP_SYS_ADMIN) sets
PF_SUPERPRIV if we use the capability.

Since we are going to be using root privilege setting PF_SUPERPRIV seems
the right thing to do.

The motivation for this that patch is that in a child user namespace
cap_raised(current_cap(),...) tests your capabilities with respect to that
child user namespace not capabilities in the initial user namespace and
thus will allow processes that should be unprivielged to use the kernel
services that are only protected with cap_raised(current_cap(),..).

To fix possible user_namespace issues and to just clean up the code
replace cap_raised(current_cap(), CAP_SYS_ADMIN) with
capable(CAP_SYS_ADMIN).

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-10 23:21:39 -04:00
NeilBrown b16b1b6cd0 md/bitmap: fix calculation of 'chunks' - missing shift.
commit 61a0d80c "md/bitmap: discard CHUNK_BLOCK_SHIFT macro"
replaced CHUNK_BLOCK_RATIO() by the same text that was
replacing CHUNK_BLOCK_SHIFT() - which is clearly wrong.

The result is that 'chunks' is often too small by 1,
which can sometimes result in a crash (not sure how).

So use the correct replacement, and get rid of CHUNK_BLOCK_RATIO
which is no longe used.

Reported-by: Karl Newman <siliconfiend@gmail.com>
Tested-by: Karl Newman <siliconfiend@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-04 17:03:18 +10:00
NeilBrown 30b8aa9172 md: fix possible corruption of array metadata on shutdown.
commit c744a65c1e
  md: don't set md arrays to readonly on shutdown.

removed the possibility of a 'BUG' when data is written to an array
that has just been switched to read-only, but also introduced the
possibility that the array metadata could be corrupted.

If, when md_notify_reboot gets the mddev lock, the array is
in a state where it is assembled but hasn't been started (as can
happen if the personality module is not available, or in other unusual
situations), then incorrect metadata will be written out making it
impossible to re-assemble the array.

So only call __md_stop_writes() if the array has actually been
activated.

This patch is needed for any stable kernel which has had the above
commit applied.

Cc: stable@vger.kernel.org
Reported-by: Christoph Nelles <evilazrael@evilazrael.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-24 10:23:16 +10:00
NeilBrown ed209584c3 md: don't call ->add_disk unless there is good reason.
Commit 7bfec5f35c

   md/raid5: If there is a spare and a want_replacement device, start replacement.

cause md_check_recovery to call ->add_disk much more often.
Instead of only when the array is degraded, it is now called whenever
md_check_recovery finds anything useful to do, which includes
updating the metadata for clean<->dirty transition.
This causes unnecessary work, and causes info messages from ->add_disk
to be reported much too often.

So refine md_check_recovery to only do any actual recovery checking
(including ->add_disk) if MD_RECOVERY_NEEDED is set.

This fix is suitable for 3.3.y:

Cc: stable@vger.kernel.org
Reported-by: Jan Ceuleers <jan.ceuleers@computer.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-24 10:23:14 +10:00
Jonathan Brassow a9ad8526bb DM RAID: Use safe version of rdev_for_each
Fix segfault caused by using rdev_for_each instead of rdev_for_each_safe

Commit dafb20fa34 mistakenly replaced a safe
iterator with an unsafe one when making some macro changes.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-24 10:23:13 +10:00
NeilBrown afbaa90b80 md/bitmap: prevent bitmap_daemon_work running while initialising bitmap
If a bitmap is added while the array is active, it is possible
for bitmap_daemon_work to run while the bitmap is being
initialised.
This is particularly a problem if bitmap_daemon_work sees
bitmap->filemap as non-NULL before it has been filled in properly.
So hold bitmap_info.mutex while filling in ->filemap
to prevent problems.

This patch is suitable for any -stable kernel, though it might not
apply cleanly before about 3.1.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-12 16:05:06 +10:00
majianpeng f4380a9158 md/raid1,raid10: Fix calculation of 'vcnt' when processing error recovery.
If r1bio->sectors % 8 != 0,then the memcmp and a later
memcpy will omit the last bio_vec.

This is suitable for any stable kernel since 3.1 when bad-block
management was introduced.

Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-12 16:04:47 +10:00
Andrei Warkentin 9e41dd35b3 MD: Bitmap version cleanup.
bitmap_new_disk_sb() would still create V3 bitmap superblock
with host-endian layout.

Perhaps I'm confused, but shouldn't bitmap_new_disk_sb() be
creating a V4 bitmap superblock instead, that is portable,
as per comment in bitmap.h?

Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-12 15:55:21 +10:00
NeilBrown 5020ad7d14 md/raid1,raid10: don't compare excess byte during consistency check.
When comparing two pages read from different legs of a mirror, only
compare the bytes that were read, not the whole page.

In most cases we read a whole page, but in some cases with
bad blocks or odd sizes devices we might read fewer than that.

This bug has been present "forever" but at worst it might cause
a report of two many mismatches and generate a little bit
extra resync IO, so there is no need to back-port to -stable
kernels.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:39:23 +10:00
majianpeng c6d2e084c7 md/raid5: Fix a bug about judging if the operation is syncing or replacing
When create a raid5 using assume-clean and echo check or repair to
sync_action.Then component disks did not operated IO but the raid
check/resync faster than normal.
Because the judgement in function analyse_stripe():
		if (do_recovery ||
		    sh->sector >= conf->mddev->recovery_cp)
			s->syncing = 1;
		else
			s->replacing = 1;
When check or repair,the recovery_cp == MaxSectore,so syncing equal zero
not one.

This bug was introduced by commit 9a3e1101b8
    md/raid5:  detect and handle replacements during recovery.
so this patch is suitable for 3.3-stable.

Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:37:38 +10:00
majianpeng a42f9d83b5 md/raid1:Remove unnecessary rcu_dereference(conf->mirrors[i].rdev).
Because rde->nr_pending > 0,so can not remove this disk.
And in any case, we aren't holding rcu_read_lock()

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:37:33 +10:00
Jes Sorensen 24b961f811 md: Avoid OOPS when reshaping raid1 to raid0
raid1 arrays do not have the notion of chunk size. Calculate the
largest chunk sector size we can use to avoid a divide by zero OOPS
when aligning the size of the new array to the chunk size.

Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:37:26 +10:00
NeilBrown 18b9837ea0 md/raid5: fix handling of bad blocks during recovery.
1/ We can only treat a known-bad-block like a read-error if we
   have the data that belongs in that block.  So fix that test.

2/ If we cannot recovery a stripe due to insufficient data,
   don't tell "md_done_sync" that the sync failed unless we really
   did fail something.  If we successfully record bad blocks,
   that is success.

Reported-by: "majianpeng" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:36:17 +10:00
majianpeng 5220ea1e64 md/raid1: If md_integrity_register() failed,run() must free the mem
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-02 09:48:38 +10:00
majianpeng 0366ef8475 md/raid0: If md_integrity_register() fails, raid0_run() must free the mem.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-02 09:48:37 +10:00
majianpeng 98d5561bfb md/linear: If md_integrity_register() fails, linear_run() must free the mem.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-02 09:48:37 +10:00
Mikulas Patocka a4ffc15219 dm: add verity target
This device-mapper target creates a read-only device that transparently
validates the data on one underlying device against a pre-generated tree
of cryptographic checksums stored on a second device.

Two checksum device formats are supported: version 0 which is already
shipping in Chromium OS and version 1 which incorporates some
improvements.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Elly Jones <ellyjones@chromium.org>
Cc: Milan Broz <mbroz@redhat.com>
Cc: Olof Johansson <olofj@chromium.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:43:38 +01:00
Mikulas Patocka a66cc28f53 dm bufio: prefetch
This patch introduces a new function dm_bufio_prefetch. It prefetches
the specified range of blocks into dm-bufio cache without waiting
for i/o completion.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:29 +01:00
Joe Thornber 67e2e2b281 dm thin: add pool target flags to control discard
Add dm thin target arguments to control discard support.

ignore_discard: Disables discard support

no_discard_passdown: Don't pass discards down to the underlying data
device, but just remove the mapping within the thin provisioning target.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:29 +01:00
Joe Thornber 104655fd4d dm thin: support discards
Support discards in the thin target.

On discard the corresponding mapping(s) are removed from the thin
device.  If the associated block(s) are no longer shared the discard
is passed to the underlying device.

All bios other than discards now have an associated deferred_entry
that is saved to the 'all_io_entry' in endio_hook.  When non-discard
IO completes and associated mappings are quiesced any discards that
were deferred, via ds_add_work() in process_discard(), will be queued
for processing by the worker thread.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

drivers/md/dm-thin.c |  173 ++++++++++++++++++++++++++++++++++++++++++++++----
 drivers/md/dm-thin.c |  172 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 158 insertions(+), 14 deletions(-)
2012-03-28 18:41:28 +01:00
Joe Thornber eb2aa48d4e dm thin: prepare to support discard
This patch contains the ground work needed for dm-thin to support discard.

  - Adds endio function that replaces shared_read_endio.

  - Introduce an explicit 'quiesced' flag into the new_mapping structure.
    Before, this was implicitly indicated by m->list being empty.

  - The map_info->ptr remains constant for the duration of a bio's trip
    through the thin target.  Make it easier to reason about it.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:28 +01:00
Alasdair G Kergon 6efd6e8309 dm thin: use dm_target_offset
Use dm_target_offset wrapper instead of referencing the awkward ti->begin
explicitly.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:28 +01:00
Joe Thornber 2dd9c257fb dm thin: support read only external snapshot origins
Support the use of an external _read only_ device as an origin for a thin
device.

Any read to an unprovisioned area of the thin device will be passed
through to the origin.  Writes trigger allocation of new blocks as
usual.

One possible use case for this would be VM hosts that want to run
guests on thinly-provisioned volumes but have the base image on another
device (possibly shared between many VMs).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:28 +01:00
Mike Snitzer c4a69ecdb4 dm thin: relax hard limit on the maximum size of a metadata device
The thin metadata format can only make use of a device that is <=
THIN_METADATA_MAX_SECTORS (currently 15.9375 GB).  Therefore, there is no
practical benefit to using a larger device.

However, it may be that other factors impose a certain granularity for
the space that is allocated to a device (E.g. lvm2 can impose a coarse
granularity through the use of large, >= 1 GB, physical extents).

Rather than reject a larger metadata device, during thin-pool device
construction, switch to allowing it but issue a warning if a device
larger than THIN_METADATA_MAX_SECTORS_WARNING (16 GB) is
provided.  Any space over 15.9375 GB will not be used.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:28 +01:00
Joe Thornber 71fd5ae25d dm persistent data: remove space map ref_count entries if redundant
Save space by removing entries from the space map ref_count tree if
they're no longer needed.

Ref counts are stored in two places: a bitmap if the ref_count is
below 3, or a btree of uint32_t if 3 or above.

When a ref_count that was above 3 drops below we can remove it from
the tree and save some metadata space.  This removal was commented out
before because I was unsure why this was causing under-populated btree
nodes.  Earlier patches have fixed this issue.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:27 +01:00
Joe Thornber 905e51b39a dm thin: commit outstanding data every second
Commit unwritten data every second to prevent too much building up.

Released blocks don't become available until after the next commit
(for crash resilience).  Prior to this patch commits were only
triggered by a message to the target or a REQ_{FLUSH,FUA} bio.  This
allowed far too big a position to build up.

The interval is hard-coded to 1 second.  This is a sensible setting.
I'm not making this user configurable, since there isn't much to be
gained by tweaking this - and a lot lost by setting it far too high.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:27 +01:00
Mikulas Patocka 31998ef193 dm: reject trailing characters in sccanf input
Device mapper uses sscanf to convert arguments to numbers. The problem is that
the way we use it ignores additional unmatched characters in the scanned string.

For example, this `if (sscanf(string, "%d", &number) == 1)' will match a number,
but also it will match number with some garbage appended, like "123abc".

As a result, device mapper accepts garbage after some numbers. For example
the command `dmsetup create vg1-new --table "0 16384 linear 254:1bla 34816bla"'
will pass without an error.

This patch fixes all sscanf uses in device mapper. It appends "%c" with
a pointer to a dummy character variable to every sscanf statement.

The construct `if (sscanf(string, "%d%c", &number, &dummy) == 1)' succeeds
only if string is a null-terminated number (optionally preceded by some
whitespace characters). If there is some character appended after the number,
sscanf matches "%c", writes the character to the dummy variable and returns 2.
We check the return value for 1 and consequently reject numbers with some
garbage appended.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:26 +01:00
Jonathan E Brassow 0447568fc5 dm raid: handle failed devices during start up
The dm-raid code currently fails to create a RAID array if any of the
superblocks cannot be read.  This was an oversight as there is already
code to handle this case if the values ('- -') were provided for the
failed array position.

With this patch, if a superblock cannot be read, the array position's
fields are initialized as though '- -' was set in the table.  That is,
the device is failed and the position should not be used, but if there
is sufficient redundancy, the array should still be activated.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:26 +01:00
Joe Thornber fef838cc1a dm thin metadata: pass correct space map to dm_sm_root_size
Fix a harmless typo.

The root is a chunk of data that gets written to the superblock.  This
data is used to recreate the space map when opening a metadata area.
We have two space maps; one tracking space on the metadata device and
one of the data device.  Both of these use the same format for their
root, so this typo was harmless.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:25 +01:00
Joe Thornber a3aefb395e dm persistent data: remove redundant value_size arg from value_ptr
Now that the value_size is held within every node of the btrees we can
remove this argument from value_ptr().

For the last few months a BUG_ON has been checking this argument is
the same as that held in the node.  No issues were reported.  So this
is a safe change.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:25 +01:00
Jun'ichi Nomura 466891f995 dm mpath: detect invalid map_context
The map_context pointer should always be set. However, we have reports
that upon requeuing it is not set correctly.  So add set and clear
functions with a BUG_ON() to track the issue properly.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:25 +01:00
Hannes Reinecke 4d7b38b7d9 dm: clear bi_end_io on remapping failure
As a precaution, set bi_end_io to NULL when failing to remap.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:25 +01:00
Hannes Reinecke 574ce07eb0 dm table: simplify call to free_devices
free_devices in dm_table.c already uses list_for_each(), so we don't
need to check if the list is empty.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:24 +01:00
Joe Thornber fe878f34df dm thin: correct comments
Remove documentation for unimplemented 'trim' message.

I'd planned a 'trim' target message for shrinking thin devices, but
this is better handled via the discard ioctl.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:24 +01:00
Alasdair G Kergon 035220b33d dm raid: no longer experimental
The dm raid module (using md) is becoming the preferred way of creating long-lived
mirrors through userspace LVM so remove the EXPERIMENTAL tag.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:24 +01:00
Alasdair G Kergon e0b215da8f dm uevent: no longer experimental
Drop EXPERIMENTAL tag from dm-uevent.

It's not changed for a while and some userspace tools are relying upon it.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:24 +01:00
Joe Thornber b0988900ba dm persistent data: fix btree rebalancing after remove
When we remove an entry from a node we sometimes rebalance with it's
two neighbours.  This wasn't being done correctly; in some cases
entries have to move all the way from the right neighbour to the left
neighbour, or vice versa.  This patch pretty much re-writes the
balancing code to fix it.

This code is barely used currently; only when you delete a thin
device, and then only if you have hundreds of them in the same pool.
Once we have discard support, which removes mappings, this will be used
much more heavily.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:23 +01:00
Joe Thornber 6f94a4c45a dm thin: fix stacked bi_next usage
Avoid using the bi_next field for the holder of a cell when deferring
bios because a stacked device below might change it.  Store the
holder in a new field in struct cell instead.

When a cell is created, the bio that triggered creation (the holder) was
added to the same bio list as subsequent bios.  In some cases we pass
this holder bio directly to devices underneath.  If those devices use
the bi_next field there will be trouble...

This also simplifies some code that had to work out which bio was the
holder.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:23 +01:00
Mikulas Patocka 72c6e7afc4 dm crypt: add missing error handling
Always set io->error to -EIO when an error is detected in dm-crypt.

There were cases where an error code would be set only if we finish
processing the last sector. If there were other encryption operations in
flight, the error would be ignored and bio would be returned with
success as if no error happened.

This bug is present in kcryptd_crypt_write_convert, kcryptd_crypt_read_convert
and kcryptd_async_done.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Reviewed-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:22 +01:00
Mikulas Patocka aeb2deae26 dm crypt: fix mempool deadlock
This patch fixes a possible deadlock in dm-crypt's mempool use.

Currently, dm-crypt reserves a mempool of MIN_BIO_PAGES reserved pages.
It allocates first MIN_BIO_PAGES with non-failing allocation (the allocation
cannot fail and waits until the mempool is refilled). Further pages are
allocated with different gfp flags that allow failing.

Because allocations may be done in parallel, this code can deadlock. Example:
There are two processes, each tries to allocate MIN_BIO_PAGES and the processes
run simultaneously.
It may end up in a situation where each process allocates (MIN_BIO_PAGES / 2)
pages. The mempool is exhausted. Each process waits for more pages to be freed
to the mempool, which never happens.

To avoid this deadlock scenario, this patch changes the code so that only
the first page is allocated with non-failing gfp mask. Allocation of further
pages may fail.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:22 +01:00
Andrei Warkentin aadbe266f2 dm exception store: fix init error path
Call the correct exit function on failure in dm_exception_store_init.

Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:22 +01:00
Linus Torvalds 267d7b23dd md updates for 3.4
Mostly tidying up code in preparation for some bigger changes
 next time.
 A few bug fixes tagged for -stable.
 
 Main functionality change is that some RAID10 arrays can now
 grow to use extra space that may have been made available on the
 individual devices.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAT2bLBjnsnt1WYoG5AQKN3xAAv1UlR5Kem5WN7Ex4lmR9xj3lr9dbURYT
 TtvrUuCy3pYYWdTuijb+IBqkbODF0kPDHIhUiBx9fXUfMavkp/b9heXS/vJ3pcH4
 1j99NUbOGL/AylD1TPRV9TQxGTKhEjK3n26bY0t/amLc92bWJaytMO1B9cz38LN+
 qx6ufpIepz4DPXXtPYpnkBR4cZ6L4/ZXQvjf5BqG6WfKwc+0Nyncg8ipYEqhBWy7
 R7ztF5yPo0yl96Wopa2KG91OroWflmyZo1DNYcbUbKtbNGGtYC92GFadOH+wNupM
 FnmXv10ivfVGU5w4SpshAwOg+4OSUqmWNsBxUhpYbf8ChbN+lOl0VZdH6UBxo19D
 3SqZWT/yz4I4HYd5rtr35MXFdOeBNM++CHQs4F68BLA0B6OcHfWsA9bvly2tnBVx
 iEBFPd277qWztUr8m6yz7AFf/0dgyXuIhuB3d7IkVrG5yG3FX6hPi2T0FSA33qMx
 Lwi5w6O4DREg5tG09xEYEnXgXe+PnB8HsKb1U/m76XMQ0UScvX6dLA6934Vg+DCv
 xf+AYqob0Tc/Op5I7h2PbVXq7DciNXwlX1WvM0m+TEaV+3fl1FB0VsCcANAV6JVn
 uRLmvtePQRt0hxAog2p7OsumVnxMhbuEo5h8rJMKWM7IbhueKNoz+gBwpcFLzBmY
 ygWc4peLQpE=
 =MGuM
 -----END PGP SIGNATURE-----

Merge tag 'md-3.4' of git://neil.brown.name/md

Pull md updates for 3.4 from Neil Brown:
 "Mostly tidying up code in preparation for some bigger changes next
  time.

  A few bug fixes tagged for -stable.

  Main functionality change is that some RAID10 arrays can now grow to
  use extra space that may have been made available on the individual
  devices."

Fixed up trivial conflicts with the k[un]map_atomic() cleanups in
drivers/md/bitmap.c.

* tag 'md-3.4' of git://neil.brown.name/md: (22 commits)
  md: Add judgement bb->unacked_exist in function md_ack_all_badblocks().
  md: fix clearing of the 'changed' flags for the bad blocks list.
  md/bitmap: discard CHUNK_BLOCK_SHIFT macro
  md/bitmap: remove unnecessary indirection when allocating.
  md/bitmap: remove some pointless locking.
  md/bitmap: change a 'goto' to a normal 'if' construct.
  md/bitmap: move printing of bitmap status to bitmap.c
  md/bitmap: remove some unused noise from bitmap.h
  md/raid10 - support resizing some RAID10 arrays.
  md/raid1: handle merge_bvec_fn in member devices.
  md/raid10: handle merge_bvec_fn in member devices.
  md: add proper merge_bvec handling to RAID0 and Linear.
  md: tidy up rdev_for_each usage.
  md/raid1,raid10: avoid deadlock during resync/recovery.
  md/bitmap: ensure to load bitmap when creating via sysfs.
  md: don't set md arrays to readonly on shutdown.
  md: allow re-add to failed arrays.
  md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read().
  md: Use existed macros instead of numbers
  md/raid5: removed unused 'added_devices' variable.
  ...
2012-03-22 12:29:50 -07:00
Linus Torvalds 9f3938346a Merge branch 'kmap_atomic' of git://github.com/congwang/linux
Pull kmap_atomic cleanup from Cong Wang.

It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().

Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.

* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
  feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
  highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
  drbd: remove the second argument of k[un]map_atomic()
  zcache: remove the second argument of k[un]map_atomic()
  gma500: remove the second argument of k[un]map_atomic()
  dm: remove the second argument of k[un]map_atomic()
  tomoyo: remove the second argument of k[un]map_atomic()
  sunrpc: remove the second argument of k[un]map_atomic()
  rds: remove the second argument of k[un]map_atomic()
  net: remove the second argument of k[un]map_atomic()
  mm: remove the second argument of k[un]map_atomic()
  lib: remove the second argument of k[un]map_atomic()
  power: remove the second argument of k[un]map_atomic()
  kdb: remove the second argument of k[un]map_atomic()
  udf: remove the second argument of k[un]map_atomic()
  ubifs: remove the second argument of k[un]map_atomic()
  squashfs: remove the second argument of k[un]map_atomic()
  reiserfs: remove the second argument of k[un]map_atomic()
  ocfs2: remove the second argument of k[un]map_atomic()
  ntfs: remove the second argument of k[un]map_atomic()
  ...
2012-03-21 09:40:26 -07:00
Linus Torvalds 69a7aebcf0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "It's indeed trivial -- mostly documentation updates and a bunch of
  typo fixes from Masanari.

  There are also several linux/version.h include removals from Jesper."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits)
  kcore: fix spelling in read_kcore() comment
  constify struct pci_dev * in obvious cases
  Revert "char: Fix typo in viotape.c"
  init: fix wording error in mm_init comment
  usb: gadget: Kconfig: fix typo for 'different'
  Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c"
  writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header
  writeback: fix typo in the writeback_control comment
  Documentation: Fix multiple typo in Documentation
  tpm_tis: fix tis_lock with respect to RCU
  Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c"
  Doc: Update numastat.txt
  qla4xxx: Add missing spaces to error messages
  compiler.h: Fix typo
  security: struct security_operations kerneldoc fix
  Documentation: broken URL in libata.tmpl
  Documentation: broken URL in filesystems.tmpl
  mtd: simplify return logic in do_map_probe()
  mm: fix comment typo of truncate_inode_pages_range
  power: bq27x00: Fix typos in comment
  ...
2012-03-20 21:12:50 -07:00
Cong Wang c2e022cb65 dm: remove the second argument of k[un]map_atomic()
Acked-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:28 +08:00
Cong Wang b2f46e6882 md: remove the second argument of k[un]map_atomic()
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:18 +08:00
majianpeng ecb178bb2b md: Add judgement bb->unacked_exist in function md_ack_all_badblocks().
If there are no unacked bad blocks, then there is no point searching
for them to acknowledge them.


Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:42 +11:00
NeilBrown d0962936bf md: fix clearing of the 'changed' flags for the bad blocks list.
In super_1_sync (the first hunk) we need to clear 'changed' before
checking read_seqretry(), otherwise we might race with other code
adding a bad block and so won't retry later.

In md_update_sb (the second hunk), in the case where there is no
metadata (neither persistent nor external), we treat any bad blocks as
an error.  However we need to clear the 'changed' flag before calling
md_ack_all_badblocks, else it won't do anything.

This patch is suitable for -stable release 3.0 and later.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:41 +11:00
NeilBrown 61a0d80ce4 md/bitmap: discard CHUNK_BLOCK_SHIFT macro
Be redefining ->chunkshift as the shift from sectors to chunks rather
than bytes to chunks, we can just use "bitmap->chunkshift" which is
shorter than the macro call, and less indirect.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:41 +11:00
NeilBrown 792a1d4bbf md/bitmap: remove unnecessary indirection when allocating.
These funcitons don't add anything useful except possibly the trace
points, and I don't think they are worth the extra indirection.
So remove them.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:41 +11:00
NeilBrown 5a6c824ebb md/bitmap: remove some pointless locking.
There is nothing gained by holding a lock while we check if a pointer
is NULL or not.  If there could be a race, then it could become NULL
immediately after the unlock - but there is no race here.

So just remove the locking.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:40 +11:00
NeilBrown 278c1ca2f2 md/bitmap: change a 'goto' to a normal 'if' construct.
The use of a goto makes the control flow more obscure here.

So make it a normal:
  if (x) {
     Y;
  }

No functional change.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:40 +11:00
NeilBrown 57148964d9 md/bitmap: move printing of bitmap status to bitmap.c
The part of /proc/mdstat which describes the bitmap should really
be generated by code in bitmap.c.  So move it there.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:40 +11:00
NeilBrown 4ba97dff71 md/bitmap: remove some unused noise from bitmap.h
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:40 +11:00
NeilBrown 006a09a0ae md/raid10 - support resizing some RAID10 arrays.
'resizing' an array in this context means making use of extra
space that has become available in component devices, not adding new
devices.
It also includes shrinking the array to take up less space of
component devices.

This is not supported for array with a 'far' layout.  However
for 'near' and 'offset' layout arrays, adding and removing space at
the end of the devices is easy to support, and this patch provides
that support.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:40 +11:00
NeilBrown 6b740b8d79 md/raid1: handle merge_bvec_fn in member devices.
Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.

So create a raid1 merge_bvec_fn to check that function in children
as well.

This introduces a small problem.  There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request.  So a
device added between these could end up getting a request which
violates its merge_bvec_fn.

Currently the best we can do is synchronize_sched().  This will work
providing no preemption happens.  If there is is preemption, we just
have to hope that new devices are largely consistent with old devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:39 +11:00
NeilBrown 050b66152f md/raid10: handle merge_bvec_fn in member devices.
Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.

So enhance the raid10 merge_bvec_fn to check that function in children
as well.

This introduces a small problem.  There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request.  So a
device added between these could end up getting a request which
violates its merge_bvec_fn.

Currently the best we can do is synchronize_sched().  This will work
providing no preemption happens.  If there is preemption, we just
have to hope that new devices are largely consistent with old devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:39 +11:00
NeilBrown ba13da47ff md: add proper merge_bvec handling to RAID0 and Linear.
These personalities currently set a max request size of one page
when any member device has a merge_bvec_fn because they don't
bother to call that function.

This causes extra works in splitting and combining requests.

So make the extra effort to call the merge_bvec_fn when it exists
so that we end up with larger requests out the bottom.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:39 +11:00
NeilBrown dafb20fa34 md: tidy up rdev_for_each usage.
md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
mddev.  However it uses the 'safe' version of list_for_each_entry,
and so requires the extra variable, but doesn't include 'safe' in the
name, which is useful documentation.

Consequently some places use this safe version without needing it, and
many use an explicity list_for_each entry.

So:
 - rename rdev_for_each to rdev_for_each_safe
 - create a new rdev_for_each which uses the plain
   list_for_each_entry,
 - use the 'safe' version only where needed, and convert all other
   list_for_each_entry calls to use rdev_for_each.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:39 +11:00
NeilBrown d6b42dcb99 md/raid1,raid10: avoid deadlock during resync/recovery.
If RAID1 or RAID10 is used under LVM or some other stacking
block device, it is possible to enter a deadlock during
resync or recovery.
This can happen if the upper level block device creates
two requests to the RAID1 or RAID10.  The first request gets
processed, blocks recovery and queue requests for underlying
requests in current->bio_list.  A resync request then starts
which will wait for those requests and block new IO.

But then the second request to the RAID1/10 will be attempted
and it cannot progress until the resync request completes,
which cannot progress until the underlying device requests complete,
which are on a queue behind that second request.

So allow that second request to proceed even though there is
a resync request about to start.

This is suitable for any -stable kernel.

Cc: stable@vger.kernel.org
Reported-by: Ray Morris <support@bettercgi.com>
Tested-by: Ray Morris <support@bettercgi.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:38 +11:00
NeilBrown 4474ca42e2 md/bitmap: ensure to load bitmap when creating via sysfs.
When commit 69e51b449d (md/bitmap:  separate out loading a bitmap...)
created bitmap_load, it missed calling it after bitmap_create when a
bitmap is created through the sysfs interface.
So if a bitmap is added this way, we don't allocate memory properly
and can crash.

This is suitable for any -stable release since 2.6.35.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:37 +11:00
NeilBrown c744a65c1e md: don't set md arrays to readonly on shutdown.
It seems that with recent kernel, writeback can still be happening
while shutdown is happening, and consequently data can be written
after the md reboot notifier switches all arrays to read-only.
This causes a BUG.

So don't switch them to read-only - just mark them clean and
set 'safemode' to '2' which mean that immediately after any
write the array will be switch back to 'clean'.

This could result in the shutdown happening when array is marked
dirty, thus forcing a resync on reboot.  However if you reboot
without performing a "sync" first, you get to keep both halves.

This is suitable for any stable kernel (though there might be some
conflicts with obvious fixes in earlier kernels).

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:37 +11:00
NeilBrown dc10c643e8 md: allow re-add to failed arrays.
When an array is failed (some data inaccessible) then there is no
point attempting to add a spare as it could not possibly be recovered.

However that may be value in re-adding a recently removed device.
e.g. if there is a write-intent-bitmap and it is clear, then access
to the data could be restored by this action.

So don't reject a re-add to a failed array for RAID10 and RAID5 (the
only arrays  types that check for a failed array).

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:37 +11:00
majianpeng 41fe75f60b md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read().
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-13 11:21:25 +11:00
NeilBrown 9d4c7d8799 md/raid5: removed unused 'added_devices' variable.
commit 908f4fbd26 removed the last user of this variable,
so we should discard it completely.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-13 11:21:21 +11:00
NeilBrown 547414d19f md/raid10: remove unnecessary smp_mb() from end_sync_write
Recent commit 4ca40c2ce0 (md/raid10: Allow replacement device ...)
added an smp_mb in end_sync_write.
This was to close a possible race with raid10_remove_disk.
However there is no such race as it is never attempted to remove a
disk while resync (or recovery) is happening.
so the smp_mb is just noise.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-13 11:21:20 +11:00
NeilBrown 1e3fa9bd50 md/raid5: make sure reshape_position is cleared on error path.
Leaving a valid reshape_position value in place could be confusing.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-13 11:21:18 +11:00
Linus Torvalds 5d0edf2915 Device-mapper fixes for 3.3.
Eight small device-mapper bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPV8+yAAoJEK2W1qbAHj1nZVAQAI8TNKwnBpKSW3Y9XFHqWjEx
 71wbjDkKkdEUWy52CAkSoRnQdX+ABxxGr5R60n/vJvHi4yDse56LddPzKAo4zD3c
 DVh6RB8CTIY+2IXGzjkDtelmKogKyAMlhmRoj0oLb5/29n6lnn6A0vkq4OimuFJO
 IIdgJxpRLqmV8NcSVC7qCEoErxzTNz9w7HaBBs73VhF8AcN/6Qi/z55zDOzT/Iz8
 iMHGmOHJBb8OxMN8BWWFdDh2YUz3isbM1xbBerYxy3P3WCHpxGBt7yRiHm3Yd5il
 USnJN3Kz0w6Orhgu1eeAuJz1A9cdSP62AQDdM91+v3nHz3mtTdAljmJZgzgzqs5u
 SRO24J6FD201DNh/RitDC1UzNOBqeapfqprT/gH+qM4Pl6X+vuXiSe5cxx+lTOhJ
 GErI1XYpTfzymdpQfqj6VnDMevRf0Hz+mSjEiUh8qjUv9bXHkmTrzjxCvAIEM+4h
 fJSQ0Fp77eV7Du9HkkFbEXVTYOe8VO+6E9AaplBAjZxHS6w+5tMFkHTM28JPxS98
 rYAks9QKbaZaEYZiNv7htux8n2OS9IeGHdLQpsooLh6lD4GxvBJ7NC8wUkfUzn27
 zEr2vqAYuA3PiccSHnT7tlN0PN1JlOjDCf+cdQkKfJj5w0E/qS2Fiv2UFIRLRPEa
 blSbf7wU0mpvorQJn/bd
 =lLJB
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull device-mapper fixes for 3.3 from Alasdair Kergon

Eight small device-mapper bug fixes.

* tag 'dm-3.3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm raid: fix flush support
  dm raid: set MD_CHANGE_DEVS when rebuilding
  dm thin metadata: decrement counter after removing mapped block
  dm thin metadata: unlock superblock in init_pmd error path
  dm thin metadata: remove incorrect close_device on creation error paths
  dm flakey: fix crash on read when corrupt_bio_byte not set
  dm io: fix discard support
  dm ioctl: do not leak argv if target message only contains whitespace
2012-03-08 17:21:51 -08:00
Jonathan E Brassow 0ca93de9b7 dm raid: fix flush support
Fix dm-raid flush support.

Both md and dm have support for flush, but the dm-raid target
forgot to set the flag to indicate that flushes should be
passed on.  (Important for data integrity e.g. with writeback cache
enabled.)

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:48 +00:00
Jonathan E Brassow 3aa3b2b2b1 dm raid: set MD_CHANGE_DEVS when rebuilding
The 'rebuild' parameter is used to rebuild individual devices in an
array (e.g. resynchronize a RAID1 device or recalculate a parity device
in higher RAID).  The MD_CHANGE_DEVS flag must be set when this
parameter is given in order to write out the superblocks and make the
change take immediate effect.  The code that handles new devices in
super_load already sets MD_CHANGE_DEVS and 'FirstUse'.  (The 'FirstUse'
flag was being set as a special case for rebuilds in
super_init_validation.)

Add a condition for rebuilds in super_load to take care of both flags
without the special case in 'super_init_validation'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:47 +00:00
Joe Thornber af63bcb817 dm thin metadata: decrement counter after removing mapped block
Correct the number of mapped sectors shown on a thin device's
status line by decrementing td->mapped_blocks in __remove() each time
a block is removed.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:44 +00:00
Joe Thornber 4469a5f387 dm thin metadata: unlock superblock in init_pmd error path
If dm_sm_disk_create() fails the superblock must be unlocked.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:43 +00:00
Mike Snitzer 1f3db25d8b dm thin metadata: remove incorrect close_device on creation error paths
The __open_device() error paths in __create_thin() and __create_snap()
incorrectly call __close_device() even if td was not initialized by
__open_device().  Remove this.

Also document __open_device() return values, remove a redundant
td->changed = 1 in __create_thin(), and insert an additional
safeguard against creating an already-existing device.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:41 +00:00
Mike Snitzer 1212268fd9 dm flakey: fix crash on read when corrupt_bio_byte not set
The following BUG is hit on the first read that is submitted to a dm
flakey test device while the device is "down" if the corrupt_bio_byte
feature wasn't requested when the device's table was loaded.

Example DM table that will hit this BUG:
0 2097152 flakey 8:0 2048 0 30

This bug was introduced by commit a3998799fb
(dm flakey: add corrupt_bio_byte feature) in v3.1-rc1.

BUG: unable to handle kernel paging request at ffff8801cfce3fff
IP: [<ffffffffa008c233>] corrupt_bio_data+0x6e/0xae [dm_flakey]
PGD 1606063 PUD 0
Oops: 0002 [#1] SMP
...
Call Trace:
 <IRQ>
 [<ffffffffa008c2b5>] flakey_end_io+0x42/0x48 [dm_flakey]
 [<ffffffffa00dca98>] clone_endio+0x54/0xb6 [dm_mod]
 [<ffffffff81130587>] bio_endio+0x2d/0x2f
 [<ffffffff811c819a>] req_bio_endio+0x96/0x9f
 [<ffffffff811c94b9>] blk_update_request+0x1dc/0x3a9
 [<ffffffff812f5ee2>] ? rcu_read_unlock+0x21/0x23
 [<ffffffff811c96a6>] blk_update_bidi_request+0x20/0x6e
 [<ffffffff811c9713>] blk_end_bidi_request+0x1f/0x5d
 [<ffffffff811c978d>] blk_end_request+0x10/0x12
 [<ffffffff8128f450>] scsi_io_completion+0x1e5/0x4b1
 [<ffffffff812882a9>] scsi_finish_command+0xec/0xf5
 [<ffffffff8128f830>] scsi_softirq_done+0xff/0x108
 [<ffffffff811ce284>] blk_done_softirq+0x84/0x98
 [<ffffffff81048d19>] __do_softirq+0xe3/0x1d5
 [<ffffffff8138f83f>] ? _raw_spin_lock+0x62/0x69
 [<ffffffff810997cf>] ? handle_irq_event+0x4c/0x61
 [<ffffffff8139833c>] call_softirq+0x1c/0x30
 [<ffffffff81003b37>] do_softirq+0x4b/0xa3
 [<ffffffff81048a39>] irq_exit+0x53/0xca
 [<ffffffff81398acd>] do_IRQ+0x9d/0xb4
 [<ffffffff81390333>] common_interrupt+0x73/0x73
...

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.1+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:39 +00:00
Milan Broz 0c535e0d6f dm io: fix discard support
This patch fixes a crash by recognising discards in dm_io.

Currently dm_mirror can send REQ_DISCARD bios if running over a
discard-enabled device and without support in dm_io the system
crashes badly.

BUG: unable to handle kernel paging request at 00800000
IP:  __bio_add_page.part.17+0xf5/0x1e0
...
 bio_add_page+0x56/0x70
 dispatch_io+0x1cf/0x240 [dm_mod]
 ? km_get_page+0x50/0x50 [dm_mod]
 ? vm_next_page+0x20/0x20 [dm_mod]
 ? mirror_flush+0x130/0x130 [dm_mirror]
 dm_io+0xdc/0x2b0 [dm_mod]
...

Introduced in 2.6.38-rc1 by commit 5fc2ffeabb
(dm raid1: support discard).

Signed-off-by: Milan Broz <mbroz@redhat.com>
Cc: stable@kernel.org
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:37 +00:00
Jesper Juhl 902c6a96a7 dm ioctl: do not leak argv if target message only contains whitespace
If 'argc' is zero we jump to the 'out:' label, but this leaks the
(unused) memory that 'dm_split_args()' allocated for 'argv' if the
string being split consisted entirely of whitespace.  Jump to the
'out_argv:' label instead to free up that memory.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:34 +00:00
Linus Torvalds a2e5f13ce8 3 fixes for md in 3.3-rc
2 relate to the recently added drive replacement.
 
 One causes read error in RAID10 to sometimes be retried indefinitely.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAT1VI1znsnt1WYoG5AQK47Q//d51y5QCpABFNUcgIM626zJXlBWFUSmzU
 wFOGXh5emN6/TWguzkiZwrvcspDmXMzz1zmJtGWixYb2jBpn2MHEN4uNz3Vq68w+
 IYk/dJg/CG4+lzX+6IjiHOb3+TASRx94QZHJASx68vypqniAyikshqcbUeZBMTB0
 Fu+sKqsOGYmwQfe6/vtRPVXY7DYK2dFDBRMFpmOl+o4Y2XxmmWzMw4Dg1RIEdtFS
 Jo9GwLHTnlw2xoc0XooufeT0Q2KOpqi9T8L6Nj0ORwpgsFqgtZ/kIOoGU6qOpSri
 ofLTrobVKMpjFtmiYVOp9TaBlPnd/TNX3E4WPLGNsAwYuRUFjq8evmJKjG+pOdeB
 3ArxRKRJCaI2jnVhH+NpT7i/tpkEg/8a/BoOAihX+hM/8QkmsWluaRBOGMhpuuuc
 1baPVTusi/zijO9cM8RGIXaQj5UG4s3LUpCIOIYdDyxsfmAH5KN1F2EPrU4NMME2
 96THSshIZLkgAg5ICwtva0qoHlBlEclAlVAzEomT7R9KwHojEB1xUiyMmaIdMFoy
 JjGFAMp2E5+KBKZ1eYEHjthPWCb+nZ3eYHUh0DOnEt4kASCXnn45GJREQkpkNIR/
 HhDTS8vI743unKnbCtYFMxiw/9OXZbMkdoZhobg7lxcpoQlWJ+5ziOtACl0h0Kv8
 +ET+Kp3W8K4=
 =93ms
 -----END PGP SIGNATURE-----

Merge tag 'md-3.3-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
 "Three fixes for md in 3.3-rc: Two relate to the recently added drive
  replacement.  One fixes the problem where a read error in RAID10 would
  sometimes be retried indefinitely."

* tag 'md-3.3-fixes' of git://neil.brown.name/md:
  md/raid10: fix assembling of arrays with replacement devices.
  md/raid10: fix handling of error on last working device in array.
  md/raid1: fix buglet in md_raid1_contested.
2012-03-05 16:01:25 -08:00
NeilBrown 7a90484825 md/raid10: fix assembling of arrays with replacement devices.
commit 56a2559bb6 (md/raid10: recognise replacements ...)
changed 'run' to set ->replacement or ->rdev depending on the
'Replacement' status if the device, but it didn't remove the
old unconditional setting of 'rdev'.  So it was largely ineffective.

So remove that now.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-06 10:12:45 +11:00
NeilBrown fae8cc5ed0 md/raid10: fix handling of error on last working device in array.
If we get a read error on the last working device in a RAID10 which
contains the target block, then we don't fail the device (which is
good) but we don't abort retries, which is wrong.
We end up in an infinite loop retrying the read on the one device.

This patch fixes the problem in two places:
1/ in raid10_end_read_request we don't even ask for a retry if this
   was the last usable device.  This is efficient but a little racy
   and will sometimes retry when it should not.

2/ in handle_read_error we are careful to exclude any device from
   retry which we tried to mark as faulty (that might have failed if
   it was the last device).  This is race-free but less efficient.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-02-14 11:10:10 +11:00
NeilBrown f53e29fc87 md/raid1: fix buglet in md_raid1_contested.
Since we added 'replacement' capability, RAID1 can have twice
as many devices as ->raid_disks indicates.
So md_raid1_congested needs to check that many possible devices,
not just ->raid_disks many.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-02-13 14:24:05 +11:00
Linus Torvalds 4d39aa1b99 Some simple md-related fixes.
1/ two small fixes to ensure we handle an interrupted resync properly.
 2/ avoid loading the bitmap multiple times in dm-raid
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUATzMdiTnsnt1WYoG5AQKICw/9H3Xf/3crCCVRQ+yzSdZ1ZJH24Rps9O6W
 8dLFN4/Ng/qxymWUMrgHAMq5MEEz2M3i7W+j23lFv6Oce06y8GJ4PpoYY5xlXCgO
 SIU1BaO1JFHxQn89EQtP3iOn4AOiZvX0GUObR0P8KO1mMnLmN7cg8J1kBfmQiBKu
 aXcUqqNvcywoix6ve4O/xgnZjd4IExxqG3W8U7CaIwExUDwaLY4NckxJcIJbIYy9
 iapOGMUdcyr6xm819V/xE2DyAtfFCtvAk1hfW/dM4QQctran3MzQIRFn9RW+CwHU
 ComEnv5ti/7g//JPXQArUPk4xgRHrMhqFcmmD8rozJ6FJDi8vw2e0BXaRLVqa0mK
 1qSZkr0Ot3nwAdILzgSbNXQ0Y5OJgc9OLX5GGlVibTW2VTJYFgA7jAsnqq8PAJC5
 sU5h2K3jrSy2unGy6BxleL5D/wvREE5OBnW35TEB5TYbxjp1FLgn+BWp8FfFUYWT
 Eb2cIyAj6cBFJ3ma1K0RH0dmS9cbNjuG+CLiApJOnEEsXzrp/4KnqOwg4672ewW3
 m1Ue2Qv+0avaK3sVyT+qzuemc6b0ps/dix0gMXw2pYqXQWHquW5NdUJcgD2DKFSn
 BB734nUP6KlPg0IFh1eehRHyVRLIAot/uBlUJ3bMx9xeYCkKa+twX90u6EmjTopP
 JjLxNsf6c2I=
 =k0Xz
 -----END PGP SIGNATURE-----

Merge tag 'md-3.3-fixes' of git://neil.brown.name/md

Some simple md-related fixes.

1/ two small fixes to ensure we handle an interrupted resync properly.
2/ avoid loading the bitmap multiple times in dm-raid

* tag 'md-3.3-fixes' of git://neil.brown.name/md:
  md: two small fixes to handling interrupt resync.
  Prevent DM RAID from loading bitmap twice.
2012-02-08 19:06:30 -08:00
NeilBrown db91ff55bd md: two small fixes to handling interrupt resync.
1/ If a resync is aborted we should record how far we got
 (recovery_cp) the last request that we know has completed
 (->curr_resync_completed) rather than the last request that was
 submitted (->curr_resync).

2/ When a resync aborts we still want to update the metadata with
 any changes, so set MD_CHANGE_DEVS even if we 'skip'.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-02-07 12:01:51 +11:00
Jiri Kosina 972c5ae961 Merge branch 'master' into for-next
Sync with Linus' tree to be able to apply patch to a newer
code (namely drivers/gpu/drm/gma500/psb_intel_lvds.c)
2012-02-03 23:13:05 +01:00
Jesper Juhl ad075370ba dm-bufio.c: there's no need to include linux/version.h
As 'make versioncheck' points out, drivers/md/dm-bufio.c has no need to include
linux/version.h, so this patch removes the unneeded include.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-03 22:38:12 +01:00
Jonathan Brassow 34f8ac6d79 Prevent DM RAID from loading bitmap twice.
The life cycle of a device-mapper target is:
1) create
2) resume
3) suspend
*) possibly repeat from 2
4) destroy

The dm-raid target is unconditionally calling MD's bitmap_load function upon
every resume.  If steps 2 & 3 above are repeated, bitmap_load is called
multiple times.  It is only written to be called once; otherwise, it allocates
new memory for the bitmap (without freeing the old) and incrementing the number
of pages it thinks it has without zeroing first.  This ultimately leads to
access beyond allocated memory and lost memory.

Simply avoiding the bitmap_load call upon resume is not sufficient.  If the
target was suspended while the initial recovery was only partially complete,
it needs to be restarted when the target is resumed.  This is why
'md_wakeup_thread' is called before issuing the 'mddev_resume'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-01-31 09:43:41 +11:00
Linus Torvalds b3c9dd182e Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-block
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
  Revert "block: recursive merge requests"
  block: Stop using macro stubs for the bio data integrity calls
  blockdev: convert some macros to static inlines
  fs: remove unneeded plug in mpage_readpages()
  block: Add BLKROTATIONAL ioctl
  block: Introduce blk_set_stacking_limits function
  block: remove WARN_ON_ONCE() in exit_io_context()
  block: an exiting task should be allowed to create io_context
  block: ioc_cgroup_changed() needs to be exported
  block: recursive merge requests
  block, cfq: fix empty queue crash caused by request merge
  block, cfq: move icq creation and rq->elv.icq association to block core
  block, cfq: restructure io_cq creation path for io_context interface cleanup
  block, cfq: move io_cq exit/release to blk-ioc.c
  block, cfq: move icq cache management to block core
  block, cfq: move io_cq lookup to blk-ioc.c
  block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
  block, cfq: reorganize cfq_io_context into generic and cfq specific parts
  block: remove elevator_queue->ops
  block: reorder elevator switch sequence
  ...

Fix up conflicts in:
 - block/blk-cgroup.c
	Switch from can_attach_task to can_attach
 - block/cfq-iosched.c
	conflict with now removed cic index changes (we now use q->id instead)
2012-01-15 12:24:45 -08:00
Paolo Bonzini ec8013bedd dm: do not forward ioctls from logical volumes to the underlying device
A logical volume can map to just part of underlying physical volume.
In this case, it must be treated like a partition.

Based on a patch from Alasdair G Kergon.

Cc: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-14 15:07:24 -08:00
Linus Torvalds c086ae4ed9 Two bugfixes for md.
One is a recently introduced regression that affects an unusual
 configuration with a guaranteed BUG_ON.  Has been tagged for -stable.
 The other is minor missing functionality.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUATwy6Sznsnt1WYoG5AQL+5A//TbTgElZaJ7IMY4q658afuRNtuWfevqTs
 4EoSUvarwyZN20JxUd4dFTzLQ3nu3XVmwZsDBbpRs7+Dt2m7Efp4qytqrTxHb6SR
 4gOr1KFXZi2rQFNpIg8T5+eyb+2VkbHGYffOtwS9TZnJqZZ4upffJi1EpJSfB1Bo
 ilkO8wcaNKVWzTgnQo+JVOLQQyNENs12Xc0aLVA0dZC0a37qWJTbr75r7nrtLT7A
 Gy783AG8JglRsr7AOVceqBVOpRonhFDz7G2hQqHg140m6i/GzDJrPtadovCtq7nt
 U6/Po7qbOj5eOSGrVPwS1gJQOT7deAL7Eeu7dOpbzl1Cwysbhg63piMNyDs4P/gM
 bFsR+LTbmZiaYs5G1oDwN/WTYLeq6cxY0IftShWdGoQwZRF/woJ7VAQSWNvHY8mg
 Z+EbEL3sY40+8eBk7/umT0WxQ9wYjooS/9ZowQ2ktRmt82Dwv0LXzWNTSlwhWKKt
 QBtv1er/psEKFqb2zDtlea8gDlKahaVNaiOK6RuY5CM5iBa4/zEmWVXS/i07LC7Z
 cW9swD4J3AEKSolWHWYQJBmCsKy+rUp5t0mQ5e/O4+nhCDbfe+Da0OArg6b/ygMu
 14RdyjOENxSqKi3IkCnToch+eNzCIm3ETaS2E0nSv996G+ShqsLtROOI9x9DXiu3
 nyLxAnIVp8I=
 =969y
 -----END PGP SIGNATURE-----

Merge tag 'md-3.3-fixes' of git://neil.brown.name/md

Two bugfixes for md.

One is a recently introduced regression that affects an unusual
configuration with a guaranteed BUG_ON.  Has been tagged for -stable.
The other is minor missing functionality.

* tag 'md-3.3-fixes' of git://neil.brown.name/md:
  md/raid1: perform bad-block tests for WriteMostly devices too.
  md: notify the 'degraded' sysfs attribute on failure.
2012-01-11 18:51:55 -08:00
Martin K. Petersen b1bd055d39 block: Introduce blk_set_stacking_limits function
Stacking driver queue limits are typically bounded exclusively by the
capabilities of the low level devices, not by the stacking driver
itself.

This patch introduces blk_set_stacking_limits() which has more liberal
metrics than the default queue limits function. This allows us to
inherit topology parameters from bottom devices without manually
tweaking the default limits in each driver prior to calling the stacking
function.

Since there is now a clear distinction between stacking and low-level
devices, blk_set_default_limits() has been modified to carry the more
conservative values that we used to manually set in
blk_queue_make_request().

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-11 16:27:11 +01:00
NeilBrown 307729c8bc md/raid1: perform bad-block tests for WriteMostly devices too.
We normally try to avoid reading from write-mostly devices, but when
we do we really have to check for bad blocks and be sure not to
try reading them.

With the current code, best_good_sectors might not get set and that
causes zero-length read requests to be send down which is very
confusing.

This bug was introduced in commit d2eb35acfd and so the patch
is suitable for 3.1.x and 3.2.x

Reported-and-tested-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Reported-and-tested-by: Art -kwaak- van Breemen <ard@telegraafnet.nl>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org
2012-01-11 08:35:17 +11:00
NeilBrown f2a371c5e7 md: notify the 'degraded' sysfs attribute on failure.
We currently only 'notify' changes to the 'degraded' attribute
when it decreases, not when it increases.

Notifying on failure is a little awkward as it happen in
interrupt context.
So instead, notify when we remove the failed device from the array,
which is very soon afterwards.

Reported-and-tested-by: Mikhail Balabin <mbalabin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-01-11 08:35:14 +11:00
Linus Torvalds 2943c83322 md update for 3.3
Big change is new hot-replacement.
 A slot in an array can hold 2 devices - one that
 wants-replacement and one that is the replacement.
 Once the replacement is built - either from the
 original or (in the case of errors) from elsewhere,
 the wants-replacement device will be removed.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUATwUdyDnsnt1WYoG5AQJBExAAuFQrzt7CN32nmiqaLlJ1snBC/ixOSJf7
 0X88YB+utO0qNIhiRTI1AulslRGms9pChyKmZZaxU2Hvzk1pTrFspsc8RlJTIv9S
 Si43due99Np08/Kf5rjYulyH0fcug0qIqlrCiBvc36pOIHZzam6IzrEvKFf0dqbe
 4vtLWuylEUiSEbN5gBordfTFZcxSPI/huMUgx6hEpdA4NtaPmN57/03d49k4RYgp
 f5l9htuIqdMgHwNJRUDGMooifuvILC5HXINUbsCFT6KUF/bxA75nW7W3C2BUq0V/
 CVb/ZoHFYtuaGBEcMUzN34k0DbHghPeSmlvXT4XWq7+gVNqGe33nbSJsx1oLXWr1
 m/b3j4Ublv+VVd7L1Rr40vTRB6wN5/2uN7SiD6d83ppPD0TuAY8YvMHgoLZQmQvh
 Ak4fEz07re+tueKhHwbi+1qMIw/ciQ9O/tI7r+AsVklkAxJNAfFKW6FOEFL2cjEW
 h4rbg1z9sU+xD8G01LBnJ0to/ajrJ4ch4wV/raLgi4+dJ4Tt3+/tas6WJlxHbyFF
 IiHCFM0+KcxLcwNkalZYg/zr5qu7EcwpPKLnc68m+LjXlYVWcnNHwv5WnOCHHQTw
 6yurGGlKqBfvsb8zJJGSRWqEtRSYjDlZiMt10/u+H40HiCiLX//vSvVbZoFiwWu1
 VzgEhzhvaYw=
 =j00/
 -----END PGP SIGNATURE-----

Merge tag 'md-3.3' of git://neil.brown.name/md

md update for 3.3

Big change is new hot-replacement.
A slot in an array can hold 2 devices - one that
wants-replacement and one that is the replacement.
Once the replacement is built - either from the
original or (in the case of errors) from elsewhere,
the wants-replacement device will be removed.

* tag 'md-3.3' of git://neil.brown.name/md: (36 commits)
  md/raid1: Mark device want_replacement when we see a write error.
  md/raid1: If there is a spare and a want_replacement device, start replacement.
  md/raid1: recognise replacements when assembling arrays.
  md/raid1: handle activation of replacement device when recovery completes.
  md/raid1: Allow a failed replacement device to be removed.
  md/raid1: Allocate spare to store replacement devices and their bios.
  md/raid1:  Replace use of mddev->raid_disks with conf->raid_disks.
  md/raid10: If there is a spare and a want_replacement device, start replacement.
  md/raid10: recognise replacements when assembling array.
  md/raid10: Allow replacement device to be replace old drive.
  md/raid10: handle recovery of replacement devices.
  md/raid10:  Handle replacement devices during resync.
  md/raid10: writes should get directed to replacement as well as original.
  md/raid10: allow removal of failed replacement devices.
  md/raid10: preferentially read from replacement device if possible.
  md/raid10:  change read_balance to return an rdev
  md/raid10: prepare data structures for handling replacement.
  md/raid5: Mark device want_replacement when we see a write error.
  md/raid5: If there is a spare and a want_replacement device, start replacement.
  md/raid5: recognise replacements when assembling array.
  ...
2012-01-08 13:28:33 -08:00
Al Viro ff01bb4832 fs: move code out of buffer.c
Move invalidate_bdev, block_sync_page into fs/block_dev.c.  Export
kill_bdev as well, so brd doesn't have to open code it.  Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving.  The small comment replacing it says enough.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:07 -05:00
NeilBrown 19d671695e md/raid1: Mark device want_replacement when we see a write error.
Now that WantReplacement drives are replaced cleanly, mark a drive
as want_replacement when we see a write error.  It might get failed soon so
the WantReplacement flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.

Signed-off-by:  NeilBrown <neilb@suse.de>
2011-12-23 10:17:57 +11:00
NeilBrown 7ef449d1ec md/raid1: If there is a spare and a want_replacement device, start replacement.
When attempting to add a spare to a RAID1 array, also consider
adding it as a replacement for a want_replacement device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:57 +11:00
NeilBrown c19d57980b md/raid1: recognise replacements when assembling arrays.
If a Replacement is seen, file it as such.

If we see two replacements (or two normal devices) for the one slot,
abort.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:57 +11:00
NeilBrown 8c7a2c2bcf md/raid1: handle activation of replacement device when recovery completes.
When recovery completes ->spare_active is called.
This checks if the replacement is ready and if so it fails
the original.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:57 +11:00
NeilBrown b014f14c81 md/raid1: Allow a failed replacement device to be removed.
Replacement devices are stored at a different offset, so look
there too.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:56 +11:00
NeilBrown 8f19ccb2fd md/raid1: Allocate spare to store replacement devices and their bios.
In RAID1, a replacement is much like a normal device, so we just
double the size of the relevant arrays and look at all possible
devices for reads and writes.

This means that the array looks like it is now double the size in some
way - we need to be careful about that.
In particular, we checking if the array is still degraded while
creating a recovery request we need to only consider the first 'half'
- i.e. the real (non-replacement) devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:56 +11:00
NeilBrown 301946364e md/raid1: Replace use of mddev->raid_disks with conf->raid_disks.
In general mddev->raid_disks can change unexpectedly while
conf->raid_disks will only change in a very controlled way.  So change
some uses of one to the other.

The use of mddev->raid_disks will not cause actually problems but
this way is more consistent and safer in the long term.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:56 +11:00
NeilBrown b7044d41b5 md/raid10: If there is a spare and a want_replacement device, start replacement.
When attempting to add a spare to a RAID10 array, also consider
adding it as a replacement for a want_replacement device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:56 +11:00
NeilBrown 56a2559bb6 md/raid10: recognise replacements when assembling array.
If a Replacement is seen, file it as such.

If we see two replacements (or two normal devices) for the one slot,
abort.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown 4ca40c2ce0 md/raid10: Allow replacement device to be replace old drive.
When recovery finish and spare_active is called, check for a
replace that might have just become fully synced and mark it
as such, marking the original as failed.

Then when the original is removed, move the replacement into
its position.

This means that 'replacement' and spontaneously become NULL in some
situations.  Make sure we check for those.
It also means that 'rdev' and 'replacement' could appear to be
identical - check for that too.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown 24afd80d99 md/raid10: handle recovery of replacement devices.
If there is a replacement device, then recover to it,
reading from any drives - maybe the one being replaced, maybe not.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown 9ad1aefc8a md/raid10: Handle replacement devices during resync.
If we need to resync an array which has replacement devices,
we always write any block checked to every replacement.

If the resync was bitmap-based resync we will then complete the
replacement normally.
If it was a full resync, we mark the replacements as fully recovered
when the resync finishes so no further recovery is needed.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown 475b0321a4 md/raid10: writes should get directed to replacement as well as original.
When writing, we need to submit two writes, one to the original,
and one to the replacements - if there is a replacement.

If the write to the replacement results in a write error we just
fail the device.  We only try to record write errors to the
original.

This only handles writing new data.  Writing for resync/recovery
will come later.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown c8ab903ea9 md/raid10: allow removal of failed replacement devices.
Enhance raid10_remove_disk to be able to remove ->replacement
as well as ->rdev

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown abbf098e6e md/raid10: preferentially read from replacement device if possible.
When reading (for array reads, not for recovery etc) we read from the
replacement device if it has recovered far enough.
This requires storing the chosen rdev in the 'r10_bio' so we can make
sure to drop the ref on the right device when the read finishes.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown 96c3fd1f38 md/raid10: change read_balance to return an rdev
It makes more sense to return an rdev than just an index as
read_balance() gets a reference to the rdev and so returning
the pointer make this more idiomatic.

This will be needed in a future patch when we might return
a 'replacement' rdev instead of the main rdev.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown 69335ef3bc md/raid10: prepare data structures for handling replacement.
Allow each slot in the RAID10 to have 2 devices, the want_replacement
and the replacement.

Also an r10bio to have 2 bios, and for resync/recovery allocate the
second bio if there are any replacement devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown 3a6de2924a md/raid5: Mark device want_replacement when we see a write error.
Now that WantReplacement drives are replaced cleanly, mark a drive
as WantReplacement when we see a write error.  It might get failed soon so
the WantReplacement flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by:  NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown 7bfec5f35c md/raid5: If there is a spare and a want_replacement device, start replacement.
When attempting to add a spare to a RAID[456] array, also consider
adding it as a replacement for a want_replacement device.

This requires that common md code attempt hot_add even when the array
is not formally degraded.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:53 +11:00
NeilBrown 17045f52ac md/raid5: recognise replacements when assembling array.
If a Replacement is seen, file it as such.

If we see two replacements (or two normal devices) for the one slot,
abort.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:53 +11:00
NeilBrown dd054fce88 md/raid5: handle activation of replacement device when recovery completes.
When recovery completes - as reported by a call to ->spare_active,
we clear In_sync on the original and set it on the replacement.

Then when the original gets removed we move the replacement from
'replacement' to 'rdev'.

This could race with other code that is looking at these pointers,
so we use memory barriers and careful ordering to ensure that
a reader might see one device twice, but never no devices.
Then the readers guard against using both devices, which could
only happen when writing.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:53 +11:00
NeilBrown 9a3e1101b8 md/raid5: detect and handle replacements during recovery.
During recovery we want to write to the replacement but not
the original.  So we have two new flags
 - R5_NeedReplace if this stripe has a replacement that needs to
   be written at some stage
 - R5_WantReplace if NeedReplace, and the data is available, and
   a 'sync' has been requested on this stripe.

We also distinguish between 'sync and replace' which need to read
all other devices, and 'replace' which only needs to read the
devices being replaced.

Note that during resync we always write to any replacement device.
It might not need to be written to, but as we don't read to compare,
we have to write to be sure.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:53 +11:00
NeilBrown 977df36255 md/raid5: writes should get directed to replacement as well as original.
When writing, we need to submit two writes, one to the original, and
one to the replacement - if there is a replacement.

If the write to the replacement results in a write error, we just fail
the device.  We only try to record write errors to the original.

When writing for recovery, we shouldn't write to the original.  This
will be addressed in a subsequent patch that generally addresses
recovery.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:53 +11:00
NeilBrown 657e3e4d88 md/raid5: allow removal for failed replacement devices.
Enhance raid5_remove_disk to be able to remove ->replacement
as well as ->rdev.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:52 +11:00
NeilBrown 14a75d3e07 md/raid5: preferentially read from replacement device if possible.
If a replacement device is present and has been recovered far enough,
then use it for reading into the stripe cache.

If we get an error we don't try to repair it, we just fail the device.
A replacement device that gives errors does not sound sensible.

This requires removing the setting of R5_ReadError when we get
a read error during a read that bypasses the cache.  It was probably
a bad idea anyway as we don't know that every block in the read
caused an error, and it could cause ReadError to be set for the
replacement device, which is bad.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:52 +11:00
NeilBrown 995c4275a7 md/raid5: remove redundant bio initialisations.
We current initialise some fields of a bio when preparing a
stripe_head, and again just before submitting the request.

Remove the duplication by only setting the fields that lower level
devices don't touch in raid5_build_block, and only set the changeable
fields in ops_run_io.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:52 +11:00
NeilBrown ede7ee8b4d md/raid5: raid5.h cleanup
Remove some #defines that are no longer used, and replace some
others with an enum.
And remove an unused field.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:52 +11:00
NeilBrown 671488cc25 md/raid5: allow each slot to have an extra replacement device
Just enhance data structures to record a second device per slot to be
used as a 'replacement' device, replacing the original.
We also have a second bio in each slot in each stripe_head.  This will
only be used when writing to the array - we need to write to both the
original and the replacement at the same time, so will need two bios.

For now, only try using the replacement drive for aligned-reads.
In this case, we prefer the replacement if it has been recovered far
enough, otherwise use the original.

This includes a small enhancement.  Previously we would only do
aligned reads if the target device was fully recovered.  Now we also
do them if it has recovered far enough.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:52 +11:00
NeilBrown 2d78f8c451 md: create externally visible flags for supporting hot-replace.
hot-replace is a feature being added to md which will allow a
device to be replaced without removing it from the array first.

With hot-replace a spare can be activated and recovery can start while
the original device is still in place, thus allowing a transition from
an unreliable device to a reliable device without leaving the array
degraded during the transition.  It can also be use when the original
device is still reliable but it not wanted for some reason.

This will eventually be supported in RAID4/5/6 and RAID10.

This patch adds a super-block flag to distinguish the replacement
device.  If an old kernel sees this flag it will reject the device.

It also adds two per-device flags which are viewable and settable via
sysfs.
   "want_replacement" can be set to request that a device be replaced.
   "replacement" is set to show that this device is replacing another
   device.

The "rd%d" links in /sys/block/mdXx/md only apply to the original
device, not the replacement.  We currently don't make links for the
replacement - there doesn't seem to be a need.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:51 +11:00
NeilBrown b8321b68d1 md: change hot_remove_disk to take an rdev rather than a number.
Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement).  So removing
a device based on the number won't work.  So pass the actual device
handle instead.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:51 +11:00
NeilBrown 476a7abb9b md: remove test for duplicate device when setting slot number.
When setting the slot number on a device in an active array we
currently check that the number is not already in use.
We then call into the personality's hot_add_disk function
which performs the same test and returns the same error.

Thus the common test is not needed.

As we will shortly be changing some personalities to allow duplicates
in some cases (to support hot-replace), the common test will become
inconvenient.

So remove the common test.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:51 +11:00
NeilBrown 915c420ddf md/bitmap: be more consistent when setting new bits in memory bitmap.
For each active region corresponding to a bit in the bitmap with have
a 14bit counter (and some flags).
This counts
   number of active writes + bit in the on-disk bitmap + delay-needed.

The "delay-needed" is because we always want a delay before clearing a
bit.  So the number here is normally number of active writes plus 2.
If there have been no writes for a while, we drop to 1.
If still no writes we clear the bit and drop to 0.

So for consistency, when setting bit from the on-disk bitmap or by
request from user-space it is best to set the counter to '2' to start
with.

In particular we might also set the NEEDED_MASK flag at this time, and
in all other cases NEEDED_MASK is only set when the counter is 2 or
more.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:51 +11:00
NeilBrown 908f4fbd26 md/raid5: be more thorough in calculating 'degraded' value.
When an array is being reshaped to change the number of devices,
the two halves can be differently degraded.  e.g. one could be
missing a device and the other not.

So we need to be more careful about calculating the 'degraded'
attribute.

Instead of just inc/dec at appropriate times, perform a full
re-calculation examining both possible cases.  This doesn't happen
often so it not a big cost, and we already have most of the code to
do it.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:50 +11:00
NeilBrown 2e61ebbcc4 md/bitmap: daemon_work cleanup.
We have a variable 'mddev' in this function, but repeatedly get the
same value by dereferencing bitmap->mddev.
There is room for simplification here...

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:50 +11:00
NeilBrown 506c9e44a8 md: allow non-privileged uses to GET_*_INFO about raid arrays.
The info is already available in /proc/mdstat and /sys/block in
an accessible form so there is no point in putting a road-block in
the ioctl for information gathering.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:26 +11:00
NeilBrown 961902c0f8 md/bitmap: It is OK to clear bits during recovery.
commit d0a4bb4927 introduced a
regression which is annoying but fairly harmless.

When writing to an array that is undergoing recovery (a spare
in being integrated into the array), writing to the array will
set bits in the bitmap, but they will not be cleared when the
write completes.

For bits covering areas that have not been recovered yet this is not a
problem as the recovery will clear the bits.  However bits set in
already-recovered region will stay set and never be cleared.
This doesn't risk data integrity.  The only negatives are:
 - next time there is a crash, more resyncing than necessary will
   be done.
 - the bitmap doesn't look clean, which is confusing.

While an array is recovering we don't want to update the
'events_cleared' setting in the bitmap but we do still want to clear
bits that have very recently been set - providing they were written to
the recovering device.

So split those two needs - which previously both depended on 'success'
and always clear the bit of the write went to all devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 09:57:48 +11:00
NeilBrown 60fc13702a md: don't give up looking for spares on first failure-to-add
Before performing a recovery we try to remove any spares that
might not be working, then add any that might have become relevant.

Currently we abort on the first spare that cannot be added.
This is a false optimisation.
It is conceivable that - depending on rules in the personality - a
subsequent spare might be accepted.
Also the loop does other things like count the available spares and
reset the 'recovery_offset' value.

If we abort early these might not happen properly.

So remove the early abort.

In particular if you have an array what is undergoing recovery and
which has extra spares, then the recovery may not restart after as
reboot as the could of 'spares' might end up as zero.

Reported-by: Anssi Hannula <anssi.hannula@iki.fi>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 09:57:19 +11:00
NeilBrown 30d7a48368 md/raid5: ensure correct assessment of drives during degraded reshape.
While reshaping a degraded array (as when reshaping a RAID0 by first
converting it to a degraded RAID4) we currently get confused about
which devices are in_sync.  In most cases we get it right, but in the
region that is being reshaped we need to treat non-failed devices as
in-sync when we have the data but haven't actually written it out yet.

Reported-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 09:57:00 +11:00
NeilBrown 09cd9270ea md/linear: fix hot-add of devices to linear arrays.
commit d70ed2e4fa
broke hot-add to a linear array.
After that commit, metadata if not written to devices until they
have been fully integrated into the array as determined by
saved_raid_disk.  That patch arranged to clear that field after
a recovery completed.

However for linear arrays, there is no recovery - the integration is
instantaneous.  So we need to explicitly clear the saved_raid_disk
field.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 09:56:55 +11:00
Adam Kwolek 5d8c71f9e5 md: raid5 crash during degradation
NULL pointer access causes crash in raid5 module.

Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-09 14:26:11 +11:00
NeilBrown 9283d8c5af md/raid5: never wait for bad-block acks on failed device.
Once a device is failed we really want to completely ignore it.
It should go away soon anyway.

In particular the presence of bad blocks on it should not cause us to
block as we won't be trying to write there anyway.

So as soon as we can check if a device is Faulty, do so and pretend
that it is already gone if it is Faulty.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08 16:27:57 +11:00
NeilBrown 8bd2f0a05b md: ensure new badblocks are handled promptly.
When we mark blocks as bad we need them to be acknowledged by the
metadata handler promptly.

For an in-kernel metadata handler that was already being done.  But
for an external metadata handler we need to alert it of the change by
sending a notification through the sysfs file.  This adds that
notification.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08 16:26:08 +11:00
NeilBrown 52c64152a9 md: bad blocks shouldn't cause a Blocked status on a Faulty device.
Once a device is marked Faulty the badblocks - whether acknowledged or
not - become irrelevant.  So they shouldn't cause the device to be
marked as Blocked.

Without this patch, a process might write "-blocked" to clear the
Blocked status, but while that will correctly fail the device, it
won't remove the apparent 'blocked' status.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08 16:22:48 +11:00
NeilBrown af8a24347f md: take a reference to mddev during sysfs access.
When we are accessing an mddev via sysfs we know that the
mddev cannot disappear because it has an embedded kobj which
is refcounted by sysfs.
And we also take the mddev_lock.
However this is not enough.

The final mddev_put could have been called and the
mddev_delayed_delete is waiting for sysfs to let go so it can destroy
the kobj and mddev.
In this state there are a lot of changes that should not be attempted.

To to guard against this we:
 - initialise mddev->all_mddevs in on last put so the state can be
   easily detected.
 - in md_attr_show and md_attr_store, check ->all_mddevs under
   all_mddevs_lock and mddev_get the mddev if it still appears to
   be active.

This means that if we get to sysfs as the mddev is being deleted we
will get -EBUSY.

rdev_attr_store and rdev_attr_show are similar but already have
sufficient protection.  They check that rdev->mddev still points to
mddev after taking mddev_lock.  As this is cleared  before delayed
removal which can only be requested under the mddev_lock, this
ensure the rdev and mddev are still alive.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08 15:49:46 +11:00
NeilBrown 1d23f178d5 md: refine interpretation of "hold_active == UNTIL_IOCTL".
We like md devices to disappear when they really are not needed.
However it is not possible to tell from the current state whether it
is needed or not.  We can only tell from recent history of changes.

In particular immediately after we create an md device it looks very
similar to immediately after we have finished with it.

So we always preserve a newly created md device until something
significant happens.  This state is stored in 'hold_active'.

The normal case is to keep it until an ioctl happens, as that will
normally either activate it, or explicitly de-activate it.  If it
doesn't then it was probably created by mistake and it is now time to
get rid of it.

We can also modify an array via sysfs (instead of via ioctl) and we
currently treat any change via sysfs like an ioctl as a sign that if
it now isn't more active, it should be destroyed.
However this is not appropriate as changes made via sysfs are more
gradual so we should look for a more definitive change.

So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
the array_state is changed via sysfs.  Other changes via sysfs
are ignored.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08 15:49:12 +11:00
NeilBrown 7c8f424798 md/lock: ensure updates to page_attrs are properly locked.
Page attributes are set using __set_bit rather than set_bit as
it normally called under a spinlock so the extra atomicity is not
needed.

However there are two places where we might set or clear page
attributes without holding the spinlock.
So add the spinlock in those cases.

This might be the cause of occasional reports that bits a aren't
getting clear properly - theory is that BITMAP_PAGE_PENDING gets lost
when BITMAP_PAGE_NEEDWRITE is set or cleared.  This is an
inconvenience, not a threat to data safety.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-11-23 10:18:52 +11:00
Dan Williams 257a4b42af md/raid5: STRIPE_ACTIVE has lock semantics, add barriers
All updates that occur under STRIPE_ACTIVE should be globally visible
when STRIPE_ACTIVE clears.  test_and_set_bit() implies a barrier, but
clear_bit() does not.

This is suitable for 3.1-stable.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2011-11-08 16:22:06 +11:00
NeilBrown 9a3f530f39 md/raid5: abort any pending parity operations when array fails.
When the number of failed devices exceeds the allowed number
we must abort any active parity operations (checks or updates) as they
are no longer meaningful, and can lead to a BUG_ON in
handle_parity_checks6.

This bug was introduce by commit 6c0069c0ae
in 2.6.29.

Reported-by: Manish Katiyar <mkatiyar@gmail.com>
Tested-by: Manish Katiyar <mkatiyar@gmail.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2011-11-08 16:22:01 +11:00
Stephen Rothwell a84450604d device-mapper: using EXPORT_SYBOL in dm-space-map-checker.c needs export.h
Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-07 10:29:10 -08:00
Stephen Rothwell 6f66263f8e device-mapper: dm-bufio.c needs to include module.h
since it uses the module facilities.

Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-07 10:29:10 -08:00
Paul Gortmaker 1944ce60fe drivers/md: change module.h -> export.h in persistent-data/dm-*
For the files which are not themselves modular, we can change
them to include only the smaller export.h since all they are
doing is looking for EXPORT_SYMBOL.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-07 10:29:09 -08:00
Linus Torvalds 32aaeffbd4 Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
  Revert "tracing: Include module.h in define_trace.h"
  irq: don't put module.h into irq.h for tracking irqgen modules.
  bluetooth: macroize two small inlines to avoid module.h
  ip_vs.h: fix implicit use of module_get/module_put from module.h
  nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
  include: replace linux/module.h with "struct module" wherever possible
  include: convert various register fcns to macros to avoid include chaining
  crypto.h: remove unused crypto_tfm_alg_modname() inline
  uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
  pm_runtime.h: explicitly requires notifier.h
  linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
  miscdevice.h: fix up implicit use of lists and types
  stop_machine.h: fix implicit use of smp.h for smp_processor_id
  of: fix implicit use of errno.h in include/linux/of.h
  of_platform.h: delete needless include <linux/module.h>
  acpi: remove module.h include from platform/aclinux.h
  miscdevice.h: delete unnecessary inclusion of module.h
  device_cgroup.h: delete needless include <linux/module.h>
  net: sch_generic remove redundant use of <linux/module.h>
  net: inet_timewait_sock doesnt need <linux/module.h>
  ...

Fix up trivial conflicts (other header files, and  removal of the ab3550 mfd driver) in
 - drivers/media/dvb/frontends/dibx000_common.c
 - drivers/media/video/{mt9m111.c,ov6650.c}
 - drivers/mfd/ab3550-core.c
 - include/linux/dmaengine.h
2011-11-06 19:44:47 -08:00
Linus Torvalds b4fdcb02f1 Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
  block: don't call blk_drain_queue() if elevator is not up
  blk-throttle: use queue_is_locked() instead of lockdep_is_held()
  blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
  blk-throttle: Free up policy node associated with deleted rule
  block: warn if tag is greater than real_max_depth.
  block: make gendisk hold a reference to its queue
  blk-flush: move the queue kick into
  blk-flush: fix invalid BUG_ON in blk_insert_flush
  block: Remove the control of complete cpu from bio.
  block: fix a typo in the blk-cgroup.h file
  block: initialize the bounce pool if high memory may be added later
  block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
  block: drop @tsk from attempt_plug_merge() and explain sync rules
  block: make get_request[_wait]() fail if queue is dead
  block: reorganize throtl_get_tg() and blk_throtl_bio()
  block: reorganize queue draining
  block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
  block: pass around REQ_* flags instead of broken down booleans during request alloc/free
  block: move blk_throtl prototypes to block/blk.h
  block: fix genhd refcounting in blkio_policy_parse_and_set()
  ...

Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
 - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
 - drivers/staging/zram/zram_drv.c
2011-11-04 17:06:58 -07:00
Linus Torvalds 43672a0784 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/linux-dm
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/linux-dm:
  dm: raid fix device status indicator when array initializing
  dm log userspace: add log device dependency
  dm log userspace: fix comment hyphens
  dm: add thin provisioning target
  dm: add persistent data library
  dm: add bufio
  dm: export dm get md
  dm table: add immutable feature
  dm table: add always writeable feature
  dm table: add singleton feature
  dm kcopyd: add dm_kcopyd_zero to zero an area
  dm: remove superfluous smp_mb
  dm: use local printk ratelimit
  dm table: propagate non rotational flag
2011-11-02 17:02:37 -07:00
Paul Gortmaker daaa5f7cbe md: Add in export.h for files using EXPORT_SYMBOL
These files were getting the defines for EXPORT_SYMBOL because
device.h was including module.h.  But we are going to put an
end to that.  So add the proper export.h include now.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:31:19 -04:00
Paul Gortmaker 056075c764 md: Add module.h to all files using it implicitly
A pending cleanup will mean that module.h won't be implicitly
everywhere anymore.  Make sure the modular drivers in md dir
are actually calling out for <module.h> explicitly in advance.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:31:18 -04:00
Linus Torvalds 571109f536 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  md/raid10:  Fix bug when activating a hot-spare.
2011-10-31 15:21:29 -07:00
Jonathan E Brassow 2e727c3ca1 dm: raid fix device status indicator when array initializing
When devices in a RAID array are not in-sync, they are supposed to be
reported as such in the status output as an 'a' character, which means
"alive, but not in-sync".  But when the entire array is rebuilt 'A' is
being used, which is incorrect.  This patch corrects this to 'a'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:21:26 +00:00
Jonathan E Brassow 5a25f0eb70 dm log userspace: add log device dependency
Allow userspace dm log implementations to register their log device so it
is no longer missing from the list of device dependencies.

When device mapper targets use a device they normally call dm_get_device
which includes it in the device list returned to userspace applications
such as LVM through the DM_TABLE_DEPS ioctl.  Userspace log devices
don't use dm_get_device as userspace opens them so they are missing from
the list of dependencies.

This patch extends the DM_ULOG_CTR operation to allow userspace to
respond with the name of the log device (if appropriate) to be
registered via 'dm_get_device'.  DM_ULOG_REQUEST_VERSION is incremented.

This is backwards compatible.  If the kernel and userspace log server
have both been updated, the new information will be passed down to the
kernel and the device will be registered.  If the kernel is new, but
the log server is old, the log server will not pass down any device
information and the kernel will simply bypass the device registration
as before.  If the kernel is old but the log server is new, the log
server will see the old version number and not pass the device info.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:21:24 +00:00
Jonathan Brassow b89544575d dm log userspace: fix comment hyphens
Fix comments: clustered-disk needs a hyphen not an underscore.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:21:22 +00:00
Joe Thornber 991d9fa02d dm: add thin provisioning target
Initial EXPERIMENTAL implementation of device-mapper thin provisioning
with snapshot support.  The 'thin' target is used to create instances of
the virtual devices that are hosted in the 'thin-pool' target.  The
thin-pool target provides data sharing among devices.  This sharing is
made possible using the persistent-data library in the previous patch.

The main highlight of this implementation, compared to the previous
implementation of snapshots, is that it allows many virtual devices to
be stored on the same data volume, simplifying administration and
allowing sharing of data between volumes (thus reducing disk usage).

Another big feature is support for arbitrary depth of recursive
snapshots (snapshots of snapshots of snapshots ...).  The previous
implementation of snapshots did this by chaining together lookup tables,
and so performance was O(depth).  This new implementation uses a single
data structure so we don't get this degradation with depth.

For further information and examples of how to use this, please read
Documentation/device-mapper/thin-provisioning.txt

Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:21:18 +00:00
Joe Thornber 3241b1d3e0 dm: add persistent data library
The persistent-data library offers a re-usable framework for the storage
and management of on-disk metadata in device-mapper targets.

It's used by the thin-provisioning target in the next patch and in an
upcoming hierarchical storage target.

For further information, please read
Documentation/device-mapper/persistent-data.txt

Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:19:11 +00:00
Mikulas Patocka 95d402f057 dm: add bufio
The dm-bufio interface allows you to do cached I/O on devices,
holding recently-read blocks in memory and performing delayed writes.

We don't use buffer cache or page cache already present in the kernel, because:
* we need to handle block sizes larger than a page
* we can't allocate memory to perform reads or we'd have deadlocks

Currently, when a cache is required, we limit its size to a fraction of
available memory.  Usage can be viewed and changed in
/sys/module/dm_bufio/parameters/ .

The first user is thin provisioning, but more dm users are planned.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:19:09 +00:00
Alasdair G Kergon 3cf2e4ba74 dm: export dm get md
Export dm_get_md() for the new thin provisioning target to use.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:19:06 +00:00
Alasdair G Kergon 36a0456fbf dm table: add immutable feature
Introduce DM_TARGET_IMMUTABLE to indicate that the target type cannot be mixed
with any other target type, and once loaded into a device, it cannot be
replaced with a table containing a different type.

The thin provisioning pool device will use this.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:19:04 +00:00
Alasdair G Kergon cc6cbe141a dm table: add always writeable feature
Add a target feature flag DM_TARGET_ALWAYS_WRITEABLE to indicate that a target
does not support read-only mode.

The initial implementation of the thin provisioning target uses this.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:19:02 +00:00
Alasdair G Kergon 3791e2fc0e dm table: add singleton feature
Introduce the concept of a singleton table which contains exactly one target.

If a target type sets the DM_TARGET_SINGLETON feature bit device-mapper
will ensure that any table that includes that target contains no others.

The thin provisioning pool target uses this.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:19:00 +00:00
Mikulas Patocka 7f06965390 dm kcopyd: add dm_kcopyd_zero to zero an area
This patch introduces dm_kcopyd_zero() to make it easy to use
kcopyd to write zeros into the requested areas instead
instead of copying.  It is implemented by passing a NULL
copying source to dm_kcopyd_copy().

The forthcoming thin provisioning target uses this.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:18:58 +00:00
Namhyung Kim fbdc86f3bd dm: remove superfluous smp_mb
Since set_current_state() contains a memory barrier in it,
an additional barrier isn't needed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:18:56 +00:00
Namhyung Kim 71a16736a1 dm: use local printk ratelimit
printk_ratelimit() shares global ratelimiting state with all
other subsystems, so its usage is discouraged. Instead,
define and use dm's local state.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:18:54 +00:00
Mandeep Singh Baines 4693c9668f dm table: propagate non rotational flag
Allow QUEUE_FLAG_NONROT to propagate up the device stack if all
underlying devices are non-rotational.  Tools like ureadahead will
schedule IOs differently based on the rotational flag.

With this patch, I see boot time go from 7.75 s to 7.46 s on my device.

Suggested-by: J. Richard Barnette <jrbarnette@chromium.org>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: dm-devel@redhat.com
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:18:50 +00:00
NeilBrown 7fcc7c8acf md/raid10: Fix bug when activating a hot-spare.
This is a fairly serious bug in RAID10.

When a RAID10 array is degraded and a hot-spare is activated, the
spare does not take up the empty slot, but rather replaces the first
working device.
This is likely to make the array non-functional.   It would normally
be possible to recover the data, but that would need care and is not
guaranteed.

This bug was introduced in commit
   2bb77736ae
which first appeared in 3.1.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-31 12:59:44 +11:00
Linus Torvalds c3ae1f3356 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md: (34 commits)
  md: Fix some bugs in recovery_disabled handling.
  md/raid5: fix bug that could result in reads from a failed device.
  lib/raid6: Fix filename emitted in generated code
  md.c: trivial comment fix
  MD: Allow restarting an interrupted incremental recovery.
  md: clear In_sync bit on devices added to an active array.
  md: add proper write-congestion reporting to RAID1 and RAID10.
  md: rename "mdk_personality" to "md_personality"
  md/bitmap remove fault injection options.
  md/raid5: typedef removal: raid5_conf_t -> struct r5conf
  md/raid1: typedef removal: conf_t -> struct r1conf
  md/raid10: typedef removal: conf_t -> struct r10conf
  md/raid0: typedef removal: raid0_conf_t -> struct r0conf
  md/multipath: typedef removal: multipath_conf_t -> struct mpconf
  md/linear: typedef removal: linear_conf_t -> struct linear_conf
  md/faulty: remove typedef: conf_t -> struct faulty_conf
  md/linear: remove typedefs: dev_info_t -> struct dev_info
  md: remove typedefs: mirror_info_t -> struct mirror_info
  md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bio
  md: remove typedefs: mdk_thread_t -> struct md_thread
  ...
2011-10-26 21:39:42 +02:00
NeilBrown d890fa2b05 md: Fix some bugs in recovery_disabled handling.
In 3.0 we changed the way recovery_disabled was handle so that instead
of testing against zero, we test an mddev-> value against a conf->
value.
Two problems:
  1/ one place in raid1 was missed and still sets to '1'.
  2/ We didn't explicitly set the conf-> value at array creation
     time.
     It defaulted to '0' just like the mddev value does so they
     could appear equal and thus disable recovery.
     This did not affect normal 'md' as it calls bind_rdev_to_array
     which changes the mddev value.  However the dmraid interface
     doesn't call this and so doesn't change ->recovery_disabled; so at
     array start all recovery is incorrectly disabled.

So initialise the 'conf' value to one less that the mddev value, so
the will only be the same when explicitly set that way.

Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown  <neilb@suse.de>
2011-10-26 11:54:39 +11:00
NeilBrown 355840e7a7 md/raid5: fix bug that could result in reads from a failed device.
This bug was introduced in 415e72d034
which was in 2.6.36.

There is a small window of time between when a device fails and when
it is removed from the array.  During this time we might still read
from it, but we won't write to it - so it is possible that we could
read stale data.

We didn't need the test of 'Faulty' before because the test on
In_sync is sufficient.  Since we started allowing reads from the early
part of non-In_sync devices we need a test on Faulty too.

This is suitable for any kernel from 2.6.36 onwards, though the patch
might need a bit of tweaking in 3.0 and earlier.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-26 10:31:04 +11:00
Tao Ma 9562ad9ab3 block: Remove the control of complete cpu from bio.
bio originally has the functionality to set the complete cpu, but
it is broken.

Chirstoph said that "This code is unused, and from the all the
discussions lately pretty obviously broken.  The only thing keeping
it serves is creating more confusion and possibly more bugs."

And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
with leaving cpu control to the request based drivers, they are the
only ones that can toggle the setting anyway".

So this patch tries to remove all the work of controling complete cpu
from a bio.

Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24 16:11:30 +02:00
Alasdair G Kergon d136f2efdf dm kcopyd: fix job_pool leak
Fix memory leak introduced by commit a6e50b409d
(dm snapshot: skip reading origin when overwriting complete chunk).

When allocating a set of jobs from kc->job_pool, job->master_job must be
set (to point to itself) so that the mempool item gets freed when the
master_job completes.

master_job was introduced by commit c6ea41fbbe
(dm kcopyd: preallocate sub jobs to avoid deadlock)

Reported-by: Michael Leun <ml@newton.leun.net>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-23 20:55:17 +01:00
Jens Axboe 5c04b426f2 Merge branch 'v3.1-rc10' into for-3.2/core
Conflicts:
	block/blk-core.c
	include/linux/blkdev.h

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:30:42 +02:00
Chris Dunlop 751e67ca2e md.c: trivial comment fix
Trivial comment fix

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-19 17:15:15 +11:00
Andrei Warkentin d70ed2e4fa MD: Allow restarting an interrupted incremental recovery.
If an incremental recovery was interrupted, a subsequent
re-add will result in a full recovery, even though an
incremental should be possible (seen with raid1).

Solve this problem by not updating the superblock on the
recovering device until array is not degraded any longer.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-18 12:16:48 +11:00
NeilBrown d30519fc59 md: clear In_sync bit on devices added to an active array.
When we add a device to an active array it can be meaningful to set
the 'insync' flag.  This indicates that the device is in-sync with the
array except for locations recorded in the bitmap.
A bitmap-based recovery can then bring it completely in-sync.

Internally we move that flag to 'saved_raid_disk' but forgot to clear
In_sync like we do in add_new_disk.

So clear In_sync after moving its value to saved_raid_disk.

Reported-by: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-18 12:13:47 +11:00
NeilBrown 34db0cd60f md: add proper write-congestion reporting to RAID1 and RAID10.
RAID1 and RAID10 handle write requests by queuing them for handling by
a separate thread.  This is because when a write-intent-bitmap is
active we might need to update the bitmap first, so it is good to
queue a lot of writes, then do one big bitmap update for them all.

However writeback request devices to appear to be congested after a
while so it can make some guesstimate of throughput.  The infinite
queue defeats that (note that RAID5 has already has a finite queue so
it doesn't suffer from this problem).

So impose a limit on the number of pending write requests.  By default
it is 1024 which seems to be generally suitable.  Make it configurable
via module option just in case someone finds a regression.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:50:01 +11:00
NeilBrown 84fc4b56db md: rename "mdk_personality" to "md_personality"
"mdk" doesn't mean anything any more.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:49:58 +11:00
NeilBrown 29d3247ea2 md/bitmap remove fault injection options.
These are too hard to use to be much more than noise.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:49:56 +11:00
NeilBrown d1688a6d55 md/raid5: typedef removal: raid5_conf_t -> struct r5conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:49:52 +11:00
NeilBrown e809636047 md/raid1: typedef removal: conf_t -> struct r1conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:49:05 +11:00
NeilBrown e879a8793f md/raid10: typedef removal: conf_t -> struct r10conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:49:02 +11:00
NeilBrown e373ab1091 md/raid0: typedef removal: raid0_conf_t -> struct r0conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:59 +11:00
NeilBrown 69724e28ca md/multipath: typedef removal: multipath_conf_t -> struct mpconf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:57 +11:00
NeilBrown e849b9381f md/linear: typedef removal: linear_conf_t -> struct linear_conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:54 +11:00
NeilBrown 8f1ae43dd2 md/faulty: remove typedef: conf_t -> struct faulty_conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:52 +11:00
NeilBrown a71207713a md/linear: remove typedefs: dev_info_t -> struct dev_info
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:49 +11:00
NeilBrown 0f6d02d580 md: remove typedefs: mirror_info_t -> struct mirror_info
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:46 +11:00
NeilBrown 9f2c9d12bc md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bio
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:43 +11:00
NeilBrown 2b8bf3451d md: remove typedefs: mdk_thread_t -> struct md_thread
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:23 +11:00
NeilBrown fd01b88c75 md: remove typedefs: mddev_t -> struct mddev
Having mddev_t and 'struct mddev_s' is ugly and not preferred

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:47:53 +11:00
NeilBrown 3cb0300200 md: removing typedefs: mdk_rdev_t -> struct md_rdev
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:45:26 +11:00
NeilBrown 50de8df4ab md/raid0: convert some printks to pr_debug.
When md assembles a RAID0 array it prints out lots of info which
is really just for debugging, so convert that to pr_debug.
It also prints out the resulting configuration which could be
interesting, so keep that as 'printk' but tidy it up a bit.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:23:22 +11:00
NeilBrown 36a4e1fe0f md: remove PRINTK and dprintk debugging and use pr_debug
Being able to dynamically enable these make them much more useful.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:23:17 +11:00
NeilBrown bdc04e6b15 md: remove some old DEBUGging code.
This code is not really helpful and is hard to maintain, so just
discard it.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:23:04 +11:00
NeilBrown db298e1946 md/raid5: convert to macros into inline functions.
More type-safety.  Easier to read.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:23:00 +11:00
NeilBrown 0fc280f606 md/raid1/ avoid bio search in end_sync_read()
We know which device we just read from so we don't need to
search the bios to find out.  Just use ->read_disk.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:22:55 +11:00
Namhyung Kim ba3ae3bee3 md/raid1: factor out common bio handling code
When normal-write and sync-read/write bio completes, we should
find out the disk number the bio belongs to. Factor those common
code out to a separate function.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:22:53 +11:00
NeilBrown e4f869d9de md/raid5: remove pointless NULL test.
In the 'abort' branch of run(), 'conf' cannot possibly be NULL,
so remove the test.

Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:22:49 +11:00
NeilBrown ce550c2059 md/raid1: add documentation to r1_private_data_s data structure.
There wasn't much and it is inconsistent.
Also rearrange fields to keep related fields together.

Reported-by: Aapo Laine <aapo.laine@shiftmail.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:22:33 +11:00
Linus Torvalds 6367f1775e Merge branch 'for-linus' of http://people.redhat.com/agk/git/linux-dm
* 'for-linus' of http://people.redhat.com/agk/git/linux-dm:
  dm crypt: always disable discard_zeroes_data
  dm: raid fix write_mostly arg validation
  dm table: avoid crash if integrity profile changes
  dm: flakey fix corrupt_bio_byte error path
2011-10-06 08:31:47 -07:00
Milan Broz 983c7db347 dm crypt: always disable discard_zeroes_data
If optional discard support in dm-crypt is enabled, discards requests
bypass the crypt queue and blocks of the underlying device are discarded.
For the read path, discarded blocks are handled the same as normal
ciphertext blocks, thus decrypted.

So if the underlying device announces discarded regions return zeroes,
dm-crypt must disable this flag because after decryption there is just
random noise instead of zeroes.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-09-25 23:26:21 +01:00
Jonthan Brassow 8232480944 dm: raid fix write_mostly arg validation
Fix off-by-one error in validation of write_mostly.

The user-supplied value given for the 'write_mostly' argument must be an
index starting at 0.  The validation of the supplied argument failed to
check for 'N' ('>' vs '>='), which would have caused an access beyond the
end of the array.

Reported-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-09-25 23:26:19 +01:00
Mike Snitzer 876fbba1db dm table: avoid crash if integrity profile changes
Commit a63a5cf (dm: improve block integrity support) introduced a
two-phase initialization of a DM device's integrity profile.  This
patch avoids dereferencing a NULL 'template_disk' pointer in
blk_integrity_register() if there is an integrity profile mismatch in
dm_table_set_integrity().

This can occur if the integrity profiles for stacked devices in a DM
table are changed between the call to dm_table_prealloc_integrity() and
dm_table_set_integrity().

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org # 2.6.39
2011-09-25 23:26:17 +01:00
Mike Snitzer 68e58a294f dm: flakey fix corrupt_bio_byte error path
If no arguments were provided to the corrupt_bio_byte feature an error
should be returned immediately.

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-09-25 23:26:15 +01:00
Daniel P. Berrange 2dba6a911c md: don't delay reboot by 1 second if no MD devices exist
The md_notify_reboot() method includes a call to mdelay(1000),
to deal with "exotic SCSI devices" which are too volatile on
reboot. The delay is unconditional. Even if the machine does
not have any block devices, let alone MD devices, the kernel
shutdown sequence is slowed down.

1 second does not matter much with physical hardware, but with
certain virtualization use cases any wasted time in the bootup
& shutdown sequence counts for alot.

* drivers/md/md.c: md_notify_reboot() - only impose a delay if
  there was at least one MD device to be stopped during reboot

Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-23 19:54:04 +10:00
Wang Sheng-Hui 7e84152626 trival: md_k.h should be md.h in the beginning comment of file md.h
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-21 15:37:46 +10:00
NeilBrown 2585f3ef8c md/bitmap: improve handling of 'allclean'.
The 'allclean' flag is used to cache the fact that there is nothing to
do, so we can avoid waking up and scanning the bitmap regularly.

The two sorts of pages that might need the attention of the bitmap
daemon are BITMAP_PAGE_PENDING and BITMAP_PAGE_NEEDWRITE pages.

So make sure allclean reflects exactly when there are none of those.
So:
  set it before scanning all pages with either bit set.
  clear it whenever these bits are set
  clear it when we desire not to clear one of these bits.
  don't clear it any other time.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-21 15:37:46 +10:00
NeilBrown 5a537df44d md/bitmap: rename and tidy up BITMAP_PAGE_CLEAN
The flag 'BITMAP_PAGE_CLEAN' has a confusing name as it doesn't mean
that the page is clean, but rather that there are counters in the page
which allow bits in the bitmap to be cleared - i.e. maybe cleaning can
happen.

So change it to BITMAP_PAGE_PENDING and fix some irregularities:
 - Don't set it in bitmap_init_from_disk as bitmap_set_memory_bits
   sets it when needed
 - in bitmap_daemon_work, if we find a counter that is '1', but
   need_sync is set, then set BITMAP_PAGE_PENDING again (it was
   recently cleared) to ensure we don't forget about this bit.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-21 15:37:46 +10:00
NeilBrown 01f96c0a99 md: Avoid waking up a thread after it has been freed.
Two related problems:

1/ some error paths call "md_unregister_thread(mddev->thread)"
   without subsequently clearing ->thread.  A subsequent call
   to mddev_unlock will try to wake the thread, and crash.

2/ Most calls to md_wakeup_thread are protected against the thread
   disappeared either by:
      - holding the ->mutex
      - having an active request, so something else must be keeping
        the array active.
   However mddev_unlock calls md_wakeup_thread after dropping the
   mutex and without any certainty of an active request, so the
   ->thread could theoretically disappear.
   So we need a spinlock to provide some protections.

So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.

Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
cc: stable@kernel.org
2011-09-21 15:30:20 +10:00
Christoph Hellwig 5a7bbad27a block: remove support for bio remapping from ->make_request
There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:12:01 +02:00
Jens Axboe c20e8de27f block: rename __make_request() to blk_queue_bio()
Now that it's exported, lets put it in a more sane namespace.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:08:31 +02:00
Christoph Hellwig 166e1f901b block: export __make_request
Avoid the hacks need for request based device mappers currently by simply
exporting the symbol instead of trying to get it through the back door.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:08:27 +02:00
NeilBrown 27a7b260f7 md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.
0.90 metadata uses an unsigned 32bit number to count the number of
kilobytes used from each device.
This should allow up to 4TB per device.
However we multiply this by 2 (to get sectors) before casting to a
larger type, so sizes above 2TB get truncated.

Also we allow rdev->sectors to be larger than 4TB, so it is possible
for the array to be resized larger than the metadata can handle.
So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in
used.

Also the sanity check at the end of super_90_load should include level
1 as it used ->size too. (RAID0 and Linear don't use ->size at all).

Reported-by: Pim Zandbergen <P.Zandbergen@macroscoop.nl>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-10 17:21:28 +10:00
NeilBrown 079fa166a2 md/raid1,10: Remove use-after-free bug in make_request.
A single request to RAID1 or RAID10 might result in multiple
requests if there are known bad blocks that need to be avoided.

To detect if we need to submit another write request we test:
 	if (sectors_handled < (bio->bi_size >> 9)) {

However this is after we call **_write_done() so the 'bio' no longer
belongs to us - the writes could have completed and the bio freed.

So move the **_write_done call until after the test against
bio->bi_size.

This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862

Reported-by: Bruno Wolff III <bruno@wolff.to>
Tested-by: Bruno Wolff III <bruno@wolff.to>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-10 17:21:23 +10:00
NeilBrown 19d5f834d6 md/raid10: unify handling of write completion.
A write can complete at two different places:
1/ when the last member-device write completes, through
   raid10_end_write_request
2/ in make_request() when we remove the initial bias from ->remaining.

These two should do exactly the same thing and the comment says they
do, but they don't.

So factor the correct code out into a function and call it in both
places.  This makes the code much more similar to RAID1.

The difference is only significant if there is an error, and they
usually take a while, so it is unlikely that there will be an error
already when make_request is completing, so this is unlikely to cause
real problems.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-10 17:21:17 +10:00
NeilBrown 43220aa0f2 md/raid5: fix a hang on device failure.
Waiting for a 'blocked' rdev to become unblocked in the raid5d thread
cannot work with internal metadata as it is the raid5d thread which
will clear the blocked flag.
This wasn't a problem in 3.0 and earlier as we only set the blocked
flag when external metadata was used then.
However we now set it always, so we need to be more careful.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-31 12:49:14 +10:00
NeilBrown 7da64a0abc md: fix clearing of 'blocked' flag in the presence of bad blocks.
When the 'blocked' flag on a device is cleared while there are
unacknowledged bad blocks we must fail the device.  This is needed for
backwards compatability of the interface.

The code currently uses the wrong test for "unacknowledged bad blocks
exist".  Change it to the right test.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-30 16:20:17 +10:00
NeilBrown 1b6afa1758 md/linear: avoid corrupting structure while waiting for rcu_free to complete.
I don't know what I was thinking putting 'rcu' after a dynamically
sized array!  The array could still be in use when we call rcu_free()
(That is the point) so we mustn't corrupt it.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25 14:43:53 +10:00
Namhyung Kim a5bf4df0c8 md: use REQ_NOIDLE flag in md_super_write()
Queue idling is used for the anticipation of immediate
sequencial I/O's but md_super_write() is a kind of one-
shot operation, coupled with md_super_wait(), so the
idling in this case will be just a waste of time.

Specifying REQ_NOIDLE prevents it. Instead of adding
the flag to submit_bio() directly, use pre-defined
macro WRITE_FLUSH_FUA.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25 14:43:34 +10:00
NeilBrown aeb9b21184 md: ensure changes to 'write-mostly' are reflected in metadata.
The 'write-mostly' flag can be changed through sysfs.
With 0.90 metadata, those changes are reflected in the metadata.
For 1.x metadata, they aren't.

So fix super_1_sync to record 'write-mostly' status.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25 14:43:08 +10:00
NeilBrown 5ef56c8fec md: report failure if a 'set faulty' request doesn't.
Sometimes a device will refuse to be set faulty.  e.g. RAID1 will
never let the last working device become faulty.

So check if "md_error()" did manage to set the faulty flag and fail
with EBUSY if it didn't.

Resolves-Debian-Bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=601198
Reported-by: Mike Hommey <mh+reportbug@glandium.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25 14:42:51 +10:00
Mike Snitzer ed8b752bcc dm table: set flush capability based on underlying devices
DM has always advertised both REQ_FLUSH and REQ_FUA flush capabilities
regardless of whether or not a given DM device's underlying devices
also advertised a need for them.

Block's flush-merge changes from 2.6.39 have proven to be more costly
for DM devices.  Performance regressions have been reported even when
DM's underlying devices do not advertise that they have a write cache.

Fix the performance regressions by configuring a DM device's flushing
capabilities based on those of the underlying devices' capabilities.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:08 +01:00
Milan Broz 772ae5f54d dm crypt: optionally support discard requests
Add optional parameter field to dmcrypt table and support
"allow_discards" option.

Discard requests bypass crypt queue processing. Bio is simple remapped
to underlying device.

Note that discard will be never enabled by default because of security
consequences.  It is up to the administrator to enable it for encrypted
devices.

(Note that userspace cryptsetup does not understand new optional
parameters yet.  Support for this will come later.  Until then, you
should use 'dmsetup' to enable and disable this.)

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:08 +01:00
Jonathan Brassow 327372797c dm raid: add md raid1 support
Support the MD RAID1 personality through dm-raid.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:07 +01:00
Jonathan Brassow b12d437b73 dm raid: support metadata devices
Add the ability to parse and use metadata devices to dm-raid.  Although
not strictly required, without the metadata devices, many features of
RAID are unavailable.  They are used to store a superblock and bitmap.

The role, or position in the array, of each device must be recorded in
its superblock.  This is to help with fault handling, array reshaping,
and sanity checks.  RAID 4/5/6 devices must be loaded in a specific order:
in this way, the 'array_position' field helps validate the correctness
of the mapping when it is loaded.  It can be used during reshaping to
identify which devices are added/removed.  Fault handling is impossible
without this field.  For example, when a device fails it is recorded in
the superblock.  If this is a RAID1 device and the offending device is
removed from the array, there must be a way during subsequent array
assembly to determine that the failed device was the one removed.  This
is done by correlating the 'array_position' field and the bit-field
variable 'failed_devices'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:07 +01:00
Jonathan Brassow 46bed2b5c1 dm raid: add write_mostly parameter
Add the write_mostly parameter to RAID1 dm-raid tables.

This allows the user to set the WriteMostly flag on a RAID1 device that
should normally be avoided for read I/O.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:07 +01:00
Jonathan Brassow c1084561bb dm raid: add region_size parameter
Allow the user to specify the region_size.

Ensures that the supplied value meets md's constraints, viz. the number of
regions does not exceed 2^21.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:07 +01:00
Mikulas Patocka 759dea204c dm ioctl: forbid multiple device specifiers
Exactly one of name, uuid or device must be specified when referencing
an existing device.  This removes the ambiguity (risking the wrong
device being updated) if two conflicting parameters were specified.
Previously one parameter got used and any others were ignored silently.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mikulas Patocka ba2e19b0f4 dm ioctl: introduce __get_dev_cell
Move logic to find device based on major/minor number to a separate
function __get_dev_cell (similar to __get_uuid_cell and __get_name_cell).
This makes the function __find_device_hash_cell more straightforward.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mikulas Patocka 0ddf9644cc dm ioctl: fill in device parameters in more ioctls
Move parameter filling from find_device to __find_device_hash_cell.

This patch causes ioctls using __find_device_hash_cell
(DM_DEV_REMOVE_CMD, DM_DEV_SUSPEND_CMD - resume, DM_TABLE_CLEAR_CMD)
to return device parameters, bringing them into line with the other
ioctls.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mike Snitzer a3998799fb dm flakey: add corrupt_bio_byte feature
Add corrupt_bio_byte feature to simulate corruption by overwriting a byte at a
specified position with a specified value during intervals when the device is
"down".

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mike Snitzer b26f5e3d71 dm flakey: add drop_writes
Add 'drop_writes' option to drop writes silently while the
device is 'down'.  Reads are not touched.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:05 +01:00
Mike Snitzer dfd068b01f dm flakey: support feature args
Add the ability to specify arbitrary feature flags when creating a
flakey target.  This code uses the same target argument helpers that
the multipath target does.

Also remove the superfluous 'dm-flakey' prefixes from the error messages,
as they already contain the prefix 'flakey'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:05 +01:00
Mike Snitzer 30e4171bfe dm flakey: use dm_target_offset and support discards
Use dm_target_offset() and support discards.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:05 +01:00
Mike Snitzer 498f0103ea dm table: share target argument parsing functions
Move multipath target argument parsing code into dm-table so other
targets can share it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:04 +01:00
Mikulas Patocka a6e50b409d dm snapshot: skip reading origin when overwriting complete chunk
If we write a full chunk in the snapshot, skip reading the origin device
because the whole chunk will be overwritten anyway.

This patch changes the snapshot write logic when a full chunk is written.
In this case:
  1. allocate the exception
  2. dispatch the bio (but don't report the bio completion to device mapper)
  3. write the exception record
  4. report bio completed

Callbacks must be done through the kcopyd thread, because callbacks must not
race with each other.  So we create two new functions:

  dm_kcopyd_prepare_callback: allocate a job structure and prepare the callback.
  (This function must not be called from interrupt context.)

  dm_kcopyd_do_callback: submit callback.
  (This function may be called from interrupt context.)

Performance test (on snapshots with 4k chunk size):
  without the patch:
    non-direct-io sequential write (dd):    17.7MB/s
    direct-io sequential write (dd):        20.9MB/s
    non-direct-io random write (mkfs.ext2): 0.44s

  with the patch:
    non-direct-io sequential write (dd):    26.5MB/s
    direct-io sequential write (dd):        33.2MB/s
    non-direct-io random write (mkfs.ext2): 0.27s

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:04 +01:00
Mikulas Patocka d5b9dd04bd dm: ignore merge_bvec for snapshots when safe
Add a new flag DMF_MERGE_IS_OPTIONAL to struct mapped_device to indicate
whether the device can accept bios larger than the size its merge
function returns.  When set, use this to send large bios to snapshots
which can split them if necessary.  Snapshot I/O may be significantly
fragmented and this approach seems to improve peformance.

Before the patch, dm_set_device_limits restricted bio size to page size
if the underlying device had a merge function and the target didn't
provide a merge function.  After the patch, dm_set_device_limits
restricts bio size to page size if the underlying device has a merge
function, doesn't have DMF_MERGE_IS_OPTIONAL flag and the target doesn't
provide a merge function.

The snapshot target can't provide a merge function because when the merge
function is called, it is impossible to determine where the bio will be
remapped.  Previously this led us to impose a 4k limit, which we can
now remove if the snapshot store is located on a device without a merge
function.  Together with another patch for optimizing full chunk writes,
it improves performance from 29MB/s to 40MB/s when writing to the
filesystem on snapshot store.

If the snapshot store is placed on a non-dm device with a merge function
(such as md-raid), device mapper still limits all bios to page size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:04 +01:00
Mike Snitzer 0864901254 dm table: clean dm_get_device and move exports
There is no need for __table_get_device to be factored out.
Also move the exports to the end of their respective functions.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:04 +01:00
Alasdair G Kergon 3e8dbb7f39 dm raid: tidy includes
A dm target only needs to use include/linux dm headers.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:03 +01:00
Alasdair G Kergon 2ca4c92f58 dm ioctl: prevent empty message
Detect invalid empty messages in core dm instead of requiring every target to
check this.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:03 +01:00
Jonathan Brassow 13c87583ea dm raid: cleanup parameter handling
Re-order the parameters so they are handled consistently in the same order
where defined, parsed and output.

Only include rebuild parameters in the STATUSTYPE_TABLE output if they were
supplied in the original table line.

Correct the parameter count when outputting rebuild: there are two words,
not one.

Use case-independent checks for keywords (as in other device-mapper targets).

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:03 +01:00
Jonathan Brassow a2d2b0345a dm snapshot: style cleanups
Coding style cleanups.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
2011-08-02 12:32:03 +01:00
Mikulas Patocka aa3f0794d2 dm snapshot: remove unused definitions
Remove a couple of unused #defines.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:03 +01:00
Mikulas Patocka 5bf45a3dcd dm kcopyd: remove nr_pages field from job structure
The nr_pages field in struct kcopyd_job is only used temporarily in
run_pages_job() to count the number of required pages.
We can use a local variable instead.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:02 +01:00
Mikulas Patocka 4622afb3f5 dm kcopyd: remove offset field from job structure
The offset field in struct kcopyd_job is always zero so remove it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:02 +01:00
Joe Perches e29e65aacb dm: use vzalloc
Use vzalloc() instead of vmalloc()+memset().

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:02 +01:00
Kirill A. Shutemov 6c9b27ab08 dm log: userspace use list_move
Replace list_del() followed by list_add() with list_move().

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:02 +01:00
Akinobu Mita c8f543e078 dm log: clean up bit little endian bitops
Using __test_and_{set,clear}_bit_le() with ignoring its return value
can be replaced with __{set,clear}_bit_le().

This also removes unnecessary casts.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:01 +01:00
Mike Snitzer 936688d7eb dm table: fix discard support
Remove 'discards_supported' from the dm_table structure.  The same
information can be easily discovered from the table's target(s) in
dm_table_supports_discards().

Before this fix dm_table_supports_discards() would skip checking the
individual targets' 'discards_supported' flag if any one target in the
table didn't set num_discard_requests > 0.  Now the per-target
'discards_supported' flag is effective at insuring the final DM device
advertises discard support.  But, to be clear, targets that don't
support discards (!num_discard_requests) will not receive discard
requests.

Also DMWARN if a target sets 'discards_supported' override but forgets
to set 'num_discard_requests'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:01 +01:00
Alasdair G Kergon 283a8328ca dm: suppress endian warnings
Suppress sparse warnings about cpu_to_le32() by using __le32 types for
on-disk data etc.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:01 +01:00
Alasdair G Kergon d15b774c29 dm: fix idr leak on module removal
Destroy _minor_idr when unloading the core dm module.  (Found by kmemleak.)

Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:01 +01:00
Mikulas Patocka bb91bc7bac dm io: flush cpu cache with vmapped io
For normal kernel pages, CPU cache is synchronized by the dma layer.
However, this is not done for pages allocated with vmalloc. If we do I/O
to/from vmallocated pages, we must synchronize CPU cache explicitly.

Prior to doing I/O on vmallocated page we must call
flush_kernel_vmap_range to flush dirty cache on the virtual address.
After finished read we must call invalidate_kernel_vmap_range to
invalidate cache on the virtual address, so that accesses to the virtual
address return newly read data and not stale data from CPU cache.

This patch fixes metadata corruption on dm-snapshots on PA-RISC and
possibly other architectures with caches indexed by virtual address.

Cc: stable <stable@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:01 +01:00
Mike Snitzer 286f367dad dm mpath: fix potential NULL pointer in feature arg processing
Avoid dereferencing a NULL pointer if the number of feature arguments
supplied is fewer than indicated.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2011-08-02 12:32:00 +01:00
Mikulas Patocka 762a80d9fc dm snapshot: flush disk cache when merging
This patch makes dm-snapshot flush disk cache when writing metadata for
merging snapshot.

Without cache flushing the disk may reorder metadata write and other
data writes and there is a possibility of data corruption in case of
power fault.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:00 +01:00
Linus Torvalds 6140333d36 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md: (75 commits)
  md/raid10: handle further errors during fix_read_error better.
  md/raid10: Handle read errors during recovery better.
  md/raid10: simplify read error handling during recovery.
  md/raid10: record bad blocks due to write errors during resync/recovery.
  md/raid10:  attempt to fix read errors during resync/check
  md/raid10:  Handle write errors by updating badblock log.
  md/raid10: clear bad-block record when write succeeds.
  md/raid10: avoid writing to known bad blocks on known bad drives.
  md/raid10 record bad blocks as needed during recovery.
  md/raid10: avoid reading known bad blocks during resync/recovery.
  md/raid10 - avoid reading from known bad blocks - part 3
  md/raid10: avoid reading from known bad blocks - part 2
  md/raid10: avoid reading from known bad blocks - part 1
  md/raid10: Split handle_read_error out from raid10d.
  md/raid10: simplify/reindent some loops.
  md/raid5: Clear bad blocks on successful write.
  md/raid5.  Don't write to known bad block on doubtful devices.
  md/raid5: write errors should be recorded as bad blocks if possible.
  md/raid5: use bad-block log to improve handling of uncorrectable read errors.
  md/raid5: avoid reading from known bad blocks.
  ...
2011-07-28 05:50:27 -07:00
NeilBrown 58c54fcca3 md/raid10: handle further errors during fix_read_error better.
If we find more read/write errors we should record a bad block before
failing the device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown 5e5702898e md/raid10: Handle read errors during recovery better.
Currently when we get a read error during recovery, we simply abort
the recovery.

Instead, repeat the read in page-sized blocks.
On successful reads, write to the target.
On read errors, record a bad block on the destination,
and only if that fails do we abort the recovery.

As we now retry reads we need to know where we read from.  This was in
bi_sector but that can be changed during a read attempt.
So store the correct from_addr and to_addr in the r10_bio for later
access.


Signed-off-by: NeilBrown<neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown e684e41db3 md/raid10: simplify read error handling during recovery.
If a read error is detected during recovery the code currently
fails the read device.
This isn't really necessary.  recovery_request_write will signal
a write error to end_sync_write and it will record a write
error on the destination device which will record a bad block
there or kick it from the array.

So just remove this call to do md_error.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown 1a0b7cd826 md/raid10: record bad blocks due to write errors during resync/recovery.
If we get a write error during resync/recovery don't fail the device
but instead record a bad block.  If that fails we can then fail the
device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown f84ee364dd md/raid10: attempt to fix read errors during resync/check
We already attempt to fix read errors found during normal IO
and a 'repair' process.
It is best to try to repair them at any time they are found,
so move a test so that during sync and check a read error will
be corrected by over-writing with good data.

If both (all) devices have known bad blocks in the sync section we
won't try to fix even though the bad blocks might not overlap.  That
should be considered later.

Also if we hit a read error during recovery we don't try to fix it.
It would only be possible to fix if there were at least three copies
of data, which is not very common with RAID10.  But it should still
be considered later.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown bd870a16c5 md/raid10: Handle write errors by updating badblock log.
When we get a write error (in the data area, not in metadata),
update the badblock log rather than failing the whole device.

As the write may well be many blocks, we trying writing each
block individually and only log the ones which fail.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown 749c55e942 md/raid10: clear bad-block record when write succeeds.
If we succeed in writing to a block that was recorded as
being bad, we clear the bad-block record.

This requires some delayed handling as the bad-block-list update has
to happen in process-context.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown d4432c23be md/raid10: avoid writing to known bad blocks on known bad drives.
Writing to known bad blocks on drives that have seen a write error
is asking for trouble.  So try to avoid these blocks.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown e875ecea26 md/raid10 record bad blocks as needed during recovery.
When recovering one or more devices, if all the good devices have
bad blocks we should record a bad block on the device being rebuilt.

If this fails, we need to abort the recovery.

To ensure we don't think that we aborted later than we actually did,
we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync,
in particular before mddev->curr_resync is updated.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown 40c356ce5a md/raid10: avoid reading known bad blocks during resync/recovery.
During resync/recovery limit the size of the request to avoid
reading into a bad block that does not start at-or-before the current
read address.

Similarly if there is a bad block at this address, don't allow the
current request to extend beyond the end of that bad block.

Now that we don't ever read from known bad blocks, it is safe to allow
devices with those blocks into the array.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown 8dbed5cebd md/raid10 - avoid reading from known bad blocks - part 3
When attempting to repair a read error, don't read from
devices with a known bad block.

As we are only reading PAGE_SIZE blocks, we don't try to
narrow down to smaller regions in the hope that only part of this
page is bad - it isn't worth the effort.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown 7399c31bc9 md/raid10: avoid reading from known bad blocks - part 2
When redirecting a read error to a different device, we must
again avoid bad blocks and possibly split the request.

Spin_lock typo fixed thanks to Dan Carpenter <error27@gmail.com>

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown 856e08e237 md/raid10: avoid reading from known bad blocks - part 1
This patch just covers the basic read path:
 1/ read_balance needs to check for badblocks, and return not only
    the chosen slot, but also how many good blocks are available
    there.
 2/ read submission must be ready to issue multiple reads to
    different devices as different bad blocks on different devices
    could mean that a single large read cannot be served by any one
    device, but can still be served by the array.
    This requires keeping count of the number of outstanding requests
    per bio.  This count is stored in 'bi_phys_segments'

On read error we currently just fail the request if another target
cannot handle the whole request.  Next patch refines that a bit.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown 560f8e5532 md/raid10: Split handle_read_error out from raid10d.
raid10d() is too big and is about to get bigger, so split
handle_read_error() out as a separate function.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown 1294b9c973 md/raid10: simplify/reindent some loops.
When a loop ends with a large if, it can be neater to change the
if to invert the condition and just 'continue'.
Then the body of the if can be indented to a lower level.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown b84db560ea md/raid5: Clear bad blocks on successful write.
On a successful write to a known bad block, flag the sh
so that raid5d can remove the known bad block from the list.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown 73e92e51b7 md/raid5. Don't write to known bad block on doubtful devices.
If a device has seen write errors, don't write to any known
bad blocks on that device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:22 +10:00
NeilBrown bc2607f393 md/raid5: write errors should be recorded as bad blocks if possible.
When a write error is detected, don't mark the device as failed
immediately but rather record the fact for handle_stripe to deal with.

Handle_stripe then attempts to record a bad block.  Only if that fails
does the device get marked as faulty.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:22 +10:00
NeilBrown 7f0da59bdc md/raid5: use bad-block log to improve handling of uncorrectable read errors.
If we get an uncorrectable read error - record a bad block rather than
failing the device.
And if these errors (which may be due to known bad blocks) cause
recovery to be impossible, record a bad block on the recovering
devices, or abort the recovery.

As we might abort a recovery without failing a device we need to teach
RAID5 about recovery_disabled handling.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:22 +10:00
NeilBrown 31c176ecdf md/raid5: avoid reading from known bad blocks.
There are two times that we might read in raid5:
1/ when a read request fits within a chunk on a single
   working device.
   In this case, if there is any bad block in the range of
   the read, we simply fail the cache-bypass read and
   perform the read though the stripe cache.

2/ when reading into the stripe cache.  In this case we
   mark as failed any device which has a bad block in that
   strip (1 page wide).
   Note that we will both avoid reading and avoid writing.
   This is correct (as we will never read from the block, there
   is no point writing), but not optimal (as writing could 'fix'
   the error) - that will be addressed later.

If we have not seen any write errors on the device yet, we treat a bad
block like a recent read error.  This will encourage an attempt to fix
the read error which will either generate a write error, or will
ensure good data is stored there.  We don't yet forget the bad block
in that case.  That comes later.

Now that we honour bad blocks when reading we can allow devices with
bad blocks into the array.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:22 +10:00
NeilBrown 62096bce23 md/raid1: factor several functions out or raid1d()
raid1d is too big with several deep branches.
So separate them out into their own functions.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:38:13 +10:00
NeilBrown 3a9f28a511 md/raid1: improve handling of read failure during recovery.
If we cannot read a block from anywhere during recovery, there is
now a better approach than just giving up.
We can record a bad block on each device and keep going - being
careful not to clear the bad block when a write succeeds as it might -
it will be a write of incorrect data.

We have now reached the state where - for raid1 - we only call
md_error if md_set_badblocks has failed.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:33:42 +10:00
NeilBrown d8f05d2995 md/raid1: record badblocks found during resync etc.
If we find a bad block while writing as part of resync/recovery we
need to report that back to raid1d which must record the bad block,
or fail the device.

Similarly when fixing a read error, a further error should just
record a bad block if possible rather than failing the device.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:33:00 +10:00
NeilBrown cd5ff9a16f md/raid1: Handle write errors by updating badblock log.
When we get a write error (in the data area, not in metadata),
update the badblock log rather than failing the whole device.

As the write may well be many blocks, we trying writing each
block individually and only log the ones which fail.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:32:41 +10:00
NeilBrown 2ca68f5ed7 md/raid1: store behind-write pages in bi_vecs.
When performing write-behind we allocate pages to store the data
during write.
Previously we just keep a list of pages.  Now we keep a list of
bi_vec which includes offset and size.
This means that the r1bio has complete information to create a new
bio which will be needed for retrying after write errors.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:32:10 +10:00
NeilBrown 4367af5561 md/raid1: clear bad-block record when write succeeds.
If we succeed in writing to a block that was recorded as
being bad, we clear the bad-block record.

This requires some delayed handling as the bad-block-list update has
to happen in process-context.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:31:49 +10:00
NeilBrown 1f68f0c4b6 md/raid1: avoid writing to known-bad blocks on known-bad drives.
If we have seen any write error on a drive, then don't write to
any known-bad blocks on that drive.
If necessary, we divide the write request up into pieces just
like we do for reads, so each piece is either all written or
all not written to any given drive.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:31:48 +10:00
NeilBrown de393cdea6 md: make it easier to wait for bad blocks to be acknowledged.
It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata.   This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed.  Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks.  So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:31:48 +10:00
NeilBrown d7a9d443bc md: add 'write_error' flag to component devices.
If a device has ever seen a write error, we will want to handle
known-bad-blocks differently.
So create an appropriate state flag and export it via sysfs.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:31:48 +10:00
NeilBrown 06f603851f md/raid1: avoid reading known bad blocks during resync
When performing resync/etc, keep the size of the request
small enough that it doesn't overlap any known bad blocks.
Devices with badblocks at the start of the request are completely
excluded.
If there is nowhere to read from due to bad blocks, record
a bad block on each target device.

Now that we never read from known-bad-blocks we can allow devices with
known-bad-blocks into a RAID1.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:31:48 +10:00
NeilBrown d2eb35acfd md/raid1: avoid reading from known bad blocks.
Now that we have a bad block list, we should not read from those
blocks.
There are several main parts to this:
  1/ read_balance needs to check for bad blocks, and return not only
     the chosen device, but also how many good blocks are available
     there.
  2/ fix_read_error needs to avoid trying to read from bad blocks.
  3/ read submission must be ready to issue multiple reads to
     different devices as different bad blocks on different devices
     could mean that a single large read cannot be served by any one
     device, but can still be served by the array.
     This requires keeping count of the number of outstanding requests
     per bio.  This count is stored in 'bi_phys_segments'
  4/ retrying a read needs to also be ready to submit a smaller read
     and queue another request for the rest.

This does not yet handle bad blocks when reading to perform resync,
recovery, or check.

'md_trim_bio' will also be used for RAID10, so put it in md.c and
export it.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:31:48 +10:00
NeilBrown 9f2f383078 md: Disable bad blocks and v0.90 metadata.
v0.90 metadata cannot record bad blocks, so when loading metadata
for such a device, set shift to -1.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:31:47 +10:00
NeilBrown 2699b67223 md: load/store badblock list from v1.x metadata
Space must have been allocated when array was created.
A feature flag is set when the badblock list is non-empty, to
ensure old kernels don't load and trust the whole device.

We only update the on-disk badblocklist when it has changed.
If the badblocklist (or other metadata) is stored on a bad block, we
don't cope very well.

If metadata has no room for bad block, flag bad-blocks as disabled,
and do the same for 0.90 metadata.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:31:47 +10:00
NeilBrown 34b343cff4 md: don't allow arrays to contain devices with bad blocks.
As no personality understand bad block lists yet, we must
reject any device that is known to contain bad blocks.
As the personalities get taught, these tests can be removed.

This only applies to raid1/raid5/raid10.
For linear/raid0/multipath/faulty the whole concept of bad blocks
doesn't mean anything so there is no point adding the checks.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:31:47 +10:00
NeilBrown 16c791a5af md/bad-block-log: add sysfs interface for accessing bad-block-log.
This can show the log (providing it fits in one page) and
allows bad blocks to be 'acknowledged' meaning that they
have safely been recorded in metadata.

Clearing bad blocks is not allowed via sysfs (except for
code testing).  A bad block can only be cleared when
a write to the block succeeds.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:31:47 +10:00
NeilBrown 2230dfe4cc md: beginnings of bad block management.
This the first step in allowing md to track bad-blocks per-device so
that we can fail individual blocks rather than the whole device.

This patch just adds a data structure for recording bad blocks, with
routines to add, remove, search the list.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:31:46 +10:00
NeilBrown a519b26dbe md: remove suspicious size_of()
When calling bioset_create we pass the size of the front_pad as
   sizeof(mddev)
which looks suspicious as mddev is a pointer and so it looks like a
common mistake where
   sizeof(*mddev)
was intended.
The size is actually correct as we want to store a pointer in the
front padding of the bios created by the bioset, so make the intent
more explicit by using
   sizeof(mddev_t *)

Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 07:56:24 +10:00
Jonathan Brassow 768e587e18 MD: generate an event when array sync is complete
This patch causes MD to generate an event (for device-mapper) when the
synchronization thread is reaped.  This is expected behavior for device-mapper.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:37 +10:00
Jonathan Brassow 3520fa4db7 MD bitmap: Revert DM dirty log hooks
Revert most of commit e384e58549
  md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.

MD should not need to use DM's dirty log - we decided to use md's
bitmaps instead.

Keeping the DIV_ROUND_UP clean-ups that were part of commit
e384e58549, however.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:37 +10:00
Jonathan Brassow 654e8b5abc MD: raid1 s/sysfs_notify_dirent/sysfs_notify_dirent_safe
If device-mapper creates a RAID1 array that includes devices to
be rebuilt, it will deref a NULL pointer when finished because
sysfs is not used by device-mapper instantiated RAID devices.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
NeilBrown 8cfa7b0f67 md/raid5: Avoid BUG caused by multiple failures.
While preparing to write a stripe we keep the parity block or blocks
locked (R5_LOCKED) - towards the end of schedule_reconstruction.

If the array is discovered to have failed before this write completes
we can leave those blocks LOCKED, and init_stripe will notice that a
free stripe still has a locked block and will complain.

So clear the R5_LOCKED flag in handle_failed_stripe, and demote the
'BUG' to a 'WARN_ON'.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
Namhyung Kim cbea21703b md/raid10: move rdev->corrected_errors counting
Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
Namhyung Kim ddd5115fe5 md/raid5: move rdev->corrected_errors counting
Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
Namhyung Kim 9d3d80113d md/raid1: move rdev->corrected_errors counting
Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO. Also included a couple of whitespace fixes on sync_page_io().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
Namhyung Kim 65a06f0674 md: get rid of unnecessary casts on page_address()
page_address() returns void pointer, so the casts can be removed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
NeilBrown 700c721389 md/raid10: Improve decision on whether to fail a device with a read error.
Normally we would fail a device with a READ error.  However if doing
so causes the array to fail, it is better to leave the device
in place and just return the read error to the caller.

The current test for decide if the array will fail is overly
simplistic.
We have a function 'enough' which can tell if the array is failed or
not, so use it to guide the decision.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
NeilBrown 2bb77736ae md/raid10: Make use of new recovery_disabled handling
When we get a read error during recovery, RAID10 previously
arranged for the recovering device to appear to fail so that
the recovery stops and doesn't restart.  This is misleading and wrong.

Instead, make use of the new recovery_disabled handling and mark
the target device and having recovery disabled.

Add appropriate checks in add_disk and remove_disk so that devices
are removed and not re-added when recovery is disabled.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
NeilBrown 5389042ffa md: change managed of recovery_disabled.
If we hit a read error while recovering a mirror, we want to abort the
recovery without necessarily failing the disk - as having a disk this
a read error is better than not having an array at all.

Currently this is managed with a per-array flag "recovery_disabled"
and is only implemented for RAID1.  For RAID10 we will need finer
grained control as we might want to disable recovery for individual
devices separately.

So push more of the decision making into the personality.
'recovery_disabled' is now a 'cookie' which is copied when the
personality want to disable recovery and is changed when a device is
added to the array as this is used as a trigger to 'try recovery
again'.

This will allow RAID10 to get the control that it needs.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
Namhyung Kim a478a069b6 md: remove ro check in md_check_recovery()
Commit c89a8eee61 ("Allow faulty devices to be removed from a
readonly array.") added some work on ro array in the function,
but it couldn't be done since we didn't allow the ro array to be
handled from the beginning. Fix it.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
Namhyung Kim 36fad858a7 md: introduce link/unlink_rdev() helpers
There are places where sysfs links to rdev are handled
in a same way. Add the helper functions to consolidate
them.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
Christian Dietrich 8bda470e8e md/raid: use printk_ratelimited instead of printk_ratelimit
As per printk_ratelimit comment, it should not be used.

Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
Akinobu Mita a0a02a7ad6 md: use proper little-endian bitops
Using __test_and_{set,clear}_bit_le() with ignoring its return value
can be replaced with __{set,clear}_bit_le().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27 11:00:36 +10:00
NeilBrown acfe726bdd md/raid5: finalise new merged handle_stripe.
handle_stripe5() and handle_stripe6() are now virtually identical.
So discard one and rename the other to 'analyse_stripe()'.

It always returns 0, so change it to 'void' and remove the 'done'
variable in handle_stripe().

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27 11:00:36 +10:00
NeilBrown 474af965fe md/raid5: move some more common code into handle_stripe
The RAID6 version of this code is usable for RAID5 providing:
  - we test "conf->max_degraded" rather than "2" as appropriate
  - we make sure s->failed_num[1] is meaningful (and not '-1')
    when s->failed > 1

The 'return 1' must become 'goto finish' in the new location.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27 11:00:36 +10:00
NeilBrown 84789554e9 md/raid5: move more common code into handle_stripe
Apart from 'prexor' which can only be set for RAID5, and
'qd_idx' which can only be meaningful for RAID6, these two
chunks of code are nearly the same.

So combine them into one adding a test to call either
handle_parity_checks5 or handle_parity_checks6 as appropriate.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27 11:00:36 +10:00
NeilBrown c8ac1803ff md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6
RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
also allow 'read-modify-write'
Apart from this difference, handle_stripe_dirtying[56] are nearly
identical.  So resolve these differences and create just one function.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27 11:00:36 +10:00
NeilBrown 93b3dbce64 md/raid5: unite fetch_block5 and fetch_block6
Provided that ->failed_num[1] is not a valid device number (which is
easily achieved) fetch_block6 provides all the functionality of
fetch_block5.

So remove the latter and rename the former to simply "fetch_block".

Then handle_stripe_fill5 and handle_stripe_fill6 become the same and
can similarly be united.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27 11:00:36 +10:00
NeilBrown 5d35e09cae md/raid5: rearrange a test in fetch_block6.
Next patch will unite fetch_block5 and fetch_block6.
First I want to make the differences a little more clear.

For RAID6 if we are writing at all and there is a failed device, then
we need to load or compute every block so we can do a
reconstruct-write.
This case isn't needed for RAID5 - we will do a read-modify-write in
that case.
So make that test a separate test in fetch_block6 rather than merged
with two other tests.

Make a similar change in fetch_block5 so the one bit that is not
needed for RAID6 is clearly separate.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27 11:00:36 +10:00
NeilBrown c5a3100062 md/raid5: move more code into common handle_stripe
The difference between the RAID5 and RAID6 code here is easily
resolved using conf->max_degraded.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27 11:00:36 +10:00
NeilBrown 3687c06188 md/raid5: Move code for finishing a reconstruction into handle_stripe.
Prior to commit ab69ae12ce the code in handle_stripe5 and
handle_stripe6 to "Finish reconstruct operations initiated by the
expansion process" was identical.
That commit added an identical stanza of code to each function, but in
different places.  That was careless.

The raid5 code was correct, so move that out into handle_stripe and
remove raid6 version.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27 11:00:36 +10:00
NeilBrown 86c374ba9f md/raid5: Remove stripe_head_state arg from handle_stripe_expansion.
This arg is only used to differentiate between RAID5 and RAID6 but
that is not needed.  For RAID5, raid5_compute_sector will set qd_idx
to "~0" so j with certainly not equals qd_idx, so there is no need
for a guard on that condition.

So remove the guard and remove the arg from the declaration and
callers of handle_stripe_expansion.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-27 11:00:36 +10:00
Arun Sharma 60063497a9 atomic: use <linux/atomic.h>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:47 -07:00
NeilBrown cc94015a9e md/raid5: move stripe_head_state and more code into handle_stripe.
By defining the 'stripe_head_state' in 'handle_stripe', we can move
some common code out of handle_stripe[56]() and into handle_stripe.

The means that all accesses for stripe_head_state in handle_stripe[56]
need to be 's->' instead of 's.', but the compiler should inline
those functions and just use a direct stack reference, and future
patches while hoist most of this code up into handle_stripe()
so we will revert to "s.".

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26 11:35:35 +10:00
NeilBrown c5709ef6a0 md/raid5: add some more fields to stripe_head_state
Adding these three fields will allow more common code to be moved
to handle_stripe()

struct field rearrangement by Namhyung Kim.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26 11:35:20 +10:00
NeilBrown f2b3b44dee md/raid5: unify stripe_head_state and r6_state
'struct stripe_head_state' stores state about the 'current' stripe
that is passed around while handling the stripe.
For RAID6 there is an extension structure: r6_state, which is also
passed around.
There is no value in keeping these separate, so move the fields from
the latter into the former.

This means that all code now needs to treat s->failed_num as an small
array, but this is a small cost.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26 11:35:19 +10:00
NeilBrown 82e5a1718b md/raid5: move common code into handle_stripe
There is common code at the start of handle_stripe5 and
handle_stripe6.  Move it into handle_stripe.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26 11:35:15 +10:00
NeilBrown c4c1663be4 md/raid5: replace sh->lock with an 'active' flag.
sh->lock is now mainly used to ensure that two threads aren't running
in the locked part of handle_stripe[56] at the same time.

That can more neatly be achieved with an 'active' flag which we set
while running handle_stripe.  If we find the flag is set, we simply
requeue the stripe for later by setting STRIPE_HANDLE.

For safety we take ->device_lock while examining the state of the
stripe and creating a summary in 'stripe_head_state / r6_state'.
This possibly isn't needed but as shared fields like ->toread,
->towrite are checked it is safer for now at least.

We leave the label after the old 'unlock' called "unlock" because it
will disappear in a few patches, so renaming seems pointless.

This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
later, but that is not a problem.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26 11:34:20 +10:00
NeilBrown cbe47ec559 md/raid5: Protect some more code with ->device_lock.
Other places that change or follow dev->towrite and dev->written take
the device_lock as well as the sh->lock.
So it should really be held in these places too.
Also, doing so will allow sh->lock to be discarded.

with merged fixes by: Namhyung Kim <namhyung@gmail.com>


Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26 11:20:35 +10:00
NeilBrown 83206d66b6 md/raid5: Remove use of sh->lock in sync_request
This is the start of a series of patches to remove sh->lock.

sync_request takes sh->lock before setting STRIPE_SYNCING to ensure
there is no race with testing it in handle_stripe[56].

Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early
in handle_stripe[56] (after getting the same lock) and perform the
same set/clear operations if it was set.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-26 11:19:49 +10:00
Linus Torvalds bbd9d6f7fb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
  vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
  isofs: Remove global fs lock
  jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
  fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
  mm/truncate.c: fix build for CONFIG_BLOCK not enabled
  fs:update the NOTE of the file_operations structure
  Remove dead code in dget_parent()
  AFS: Fix silly characters in a comment
  switch d_add_ci() to d_splice_alias() in "found negative" case as well
  simplify gfs2_lookup()
  jfs_lookup(): don't bother with . or ..
  get rid of useless dget_parent() in btrfs rename() and link()
  get rid of useless dget_parent() in fs/btrfs/ioctl.c
  fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
  drivers: fix up various ->llseek() implementations
  fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
  Ext4: handle SEEK_HOLE/SEEK_DATA generically
  Btrfs: implement our own ->llseek
  fs: add SEEK_HOLE and SEEK_DATA flags
  reiserfs: make reiserfs default to barrier=flush
  ...

Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
shrinker callout for the inode cache, that clashed with the xfs code to
start the periodic workers later.
2011-07-22 19:02:39 -07:00
Kay Sievers f15146380d fs: seq_file - add event counter to simplify poll() support
Moving the event counter into the dynamically allocated 'struc seq_file'
allows poll() support without the need to allocate its own tracking
structure.

All current users are switched over to use the new counter.

Requested-by: Andrew Morton akpm@linux-foundation.org
Acked-by: NeilBrown <neilb@suse.de>
Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 20:47:50 -04:00
Lai Jiangshan b119cbab3a md,rcu: Convert call_rcu(free_conf) to kfree_rcu()
The rcu callback free_conf() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(free_conf).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: NeilBrown <neilb@suse.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 11:05:29 -07:00
Namhyung Kim ffd96e35c1 md/raid5: get rid of duplicated call to bio_data_dir()
In raid5::make_request(), once bio_data_dir(@bi) is detected
it never (and couldn't) be changed. Use the result always.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-18 17:38:51 +10:00
Namhyung Kim 6ce328462c md/raid5: use kmem_cache_zalloc()
Replace kmem_cache_alloc + memset(,0,) to kmem_cache_zalloc.
I think it's not harmful since @conf->slab_cache already knows
actual size of struct stripe_head.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-18 17:38:50 +10:00