Commit Graph

119 Commits

Author SHA1 Message Date
Mike Snitzer 5dea271b6d dm table: pass correct dev area size to device_area_is_valid
Incorrect device area lengths are being passed to device_area_is_valid().

The regression appeared in 2.6.31-rc1 through commit
754c5fc7eb.

With the dm-stripe target, the size of the target (ti->len) was used
instead of the stripe_width (ti->len/#stripes).  An example of a
consequent incorrect error message is:

  device-mapper: table: 254:0: sdb too small for target

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23 20:30:42 +01:00
Mikulas Patocka 69885683d2 dm raid1: wake kmirrord when requeueing delayed bios after remote recovery
The recent commit 7513c2a761 (dm raid1:
add is_remote_recovering hook for clusters) changed do_writes() to
update the ms->writes list but forgot to wake up kmirrord to process it.

The rule is that when anything is being added on ms->reads, ms->writes
or ms->failures and the list was empty before we must call
wakeup_mirrord (for immediate processing) or delayed_wake (for delayed
processing).  Otherwise the bios could sit on the list indefinitely.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
CC: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23 20:30:37 +01:00
Mike Snitzer af4874e03e dm target:s introduce iterate devices fn
Add .iterate_devices to 'struct target_type' to allow a function to be
called for all devices in a DM target.  Implemented it for all targets
except those in dm-snap.c (origin and snapshot).

(The raid1 version number jumps to 1.12 because we originally reserved
1.1 to 1.11 for 'block_on_error' but ended up using 'handle_errors'
instead.)

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: martin.petersen@oracle.com
2009-06-22 10:12:33 +01:00
Christoph Hellwig 8f3d8ba20e block: move bio list helpers into bio.h
It's used by DM and MD and generally useful, so move the bio list
helpers into bio.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 08:28:09 +02:00
Jonathan Brassow 7513c2a761 dm raid1: add is_remote_recovering hook for clusters
The logging API needs an extra function to make cluster mirroring
possible.  This new function allows us to check whether a mirror
region is being recovered on another machine in the cluster.  This
helps us prevent simultaneous recovery I/O and process I/O to the
same locations on disk.

Cluster-aware log modules will implement this function.  Single
machine log modules will not.  So, there is no performance
penalty for single machine mirrors.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:30 +01:00
Mikulas Patocka 95f8fac8dc dm raid1: switch read_record from kmalloc to slab to save memory
With my previous patch to save bi_io_vec, the size of dm_raid1_read_record
is significantly increased (the vector list takes 3072 bytes on 32-bit machines
and 4096 bytes on 64-bit machines).

The structure dm_raid1_read_record used to be allocated with kmalloc,
but kmalloc aligns the size on the next power-of-two so an object
slightly greater than 4096 will allocate 8192 bytes of memory and half of
that memory will be wasted.

This patch turns kmalloc into a slab cache which doesn't have this
padding so it will reduce the memory consumed.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-04-02 19:55:24 +01:00
Milan Broz 2045e88edb dm log: move region_size validation
Move log size validation from mirror target to log constructor.

Removed PAGE_SIZE restriction we no longer think necessary.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:05:01 +00:00
Mikulas Patocka 10d3bd09a3 dm: consolidate target deregistration error handling
Change dm_unregister_target to return void and use BUG() for error
reporting.

dm_unregister_target can only fail because of programming bug in the
target driver. It can't fail because of user's behavior or disk errors.

This patch changes unregister_target to return void and use BUG if
someone tries to unregister non-registered target or unregister target
that is in use.

This patch removes code duplication (testing of error codes in all dm
targets) and reports bugs in just one place, in dm_unregister_target. In
some target drivers, these return codes were ignored, which could lead
to a situation where bugs could be missed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:04:58 +00:00
Jonathan Brassow d460c65a6a dm raid1: fix error count
Always increase the error count when I/O on a leg of a mirror fails.

The error count is used to decide whether to select an alternative
mirror leg.  If the target doesn't use the "handle_errors" feature, the
error count is not updated and the bio can get requeued forever by the
read callback.

Fix it by increasing error_count before the handle_errors feature
checking.

Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06 03:04:57 +00:00
Mikulas Patocka 18776c7316 dm raid1: flush workqueue before destruction
We queue work on keventd queue --- so this queue must be flushed in the
destructor. Otherwise, keventd could access mirror_set after it was freed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2008-11-13 23:38:52 +00:00
Ilpo Jarvinen b34578a484 dm raid1: fix do_failures
Missing braces.  Commit 1f965b1943 (dm raid1: separate region_hash interface
part1) broke it.

Signed-off-by: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Heinz Mauelshagen <hjm@redhat.com>
2008-10-30 13:33:07 +00:00
Heinz Mauelshagen 1f965b1943 dm raid1: separate region_hash interface part1
Separate the region hash code from raid1 so it can be shared by forthcoming
targets.  Use BUG_ON() for failed async dm_io() calls.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:45:06 +01:00
Mikulas Patocka 586e80e6ee dm: remove dm header from targets
Change #include "dm.h" to #include <linux/device-mapper.h> in all targets.
Targets should not need direct access to internal DM structures.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:44:59 +01:00
Mikulas Patocka d63a5ce3c0 dm: publish array_too_big
Move array_too_big to include/linux/device-mapper.h because it is
used by targets.

Remove the test from dm-raid1 as the number of mirror legs is limited
such that it can never fail.  (Even for stripes it seems rather
unlikely.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-21 17:44:57 +01:00
Jonathan Brassow f7c83e2e47 dm raid1: kcopyd should stop on error if errors handled
dm-raid1 is setting the 'DM_KCOPYD_IGNORE_ERROR' flag unconditionally
when assigning kcopyd work.  kcopyd is responsible for copying an
assigned section of disk to one or more other disks.  The
'DM_KCOPYD_IGNORE_ERROR' flag affects kcopyd in the following way:

When not set:
kcopyd will immediately stop the copy operation when an error is
encountered.

When set:
kcopyd will try to proceed regardless of errors and try to continue
copying any remaining amount.

Since dm-raid1 tracks regions of the address space that are (or
are not) in sync and it now has the ability to handle these
errors, we can safely enable this optimization.  This optimization
is conditional on whether mirror error handling has been enabled.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-10-10 13:36:59 +01:00
Mikulas Patocka 7ff14a3615 dm: unplug queues in threads
Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O.

It is specified that any submitted bio without BIO_RW_SYNC flag may plug the
queue (i.e. block the requests from being dispatched to the physical device).

The queue is unplugged when the caller calls blk_unplug() function. Usually, the
sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs
the queue and waits (to be possibly joined with other adjacent bios). Then, when
the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to
the disk.

This was happenning:

When doing O_SYNC writes, function fsync_buffers_list() submits a list of
bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is
woken up.

fsync_buffers_list() calls wait_on_buffer().  That unplugs the queue, but
there are no bios on the device queue as they are still in the dm_raid1 queue.

wait_on_buffer() starts waiting until the IO is finished.

kmirrord is scheduled, kmirrord takes bios and submits them to the devices.

The submitted bio plugs the harddisk queue but there is no one to unplug it.
(The process that called wait_on_buffer() is already sleeping.)

So there is a 3ms timeout, after which the queues on the harddisks are
unplugged and requests are processed.

This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes),
dm-raid1 is 10 times slower than md raid1.

Every time we submit something asynchronously via dm_io, we must unplug the
queue actually to send the request to the device.

This patch adds an unplug call to kmirrord - while processing requests, it keeps
the queue plugged (so that adjacent bios can be merged); when it finishes
processing all the bios, it unplugs the queue to submit the bios.

It also fixes kcopyd which has the same potential problem. All kcopyd requests
are submitted with BIO_RW_SYNC.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-25 13:26:57 +01:00
Mikulas Patocka a2aebe03be dm raid1: use timer
This patch replaces the schedule() in the main kmirrord thread with a timer.
The schedule() could introduce an unwanted delay when work is ready to be
processed.

The code instead calls wake() when there's work to be done immediately, and
delayed_wake() after a failure to give a short delay before retrying.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:56 +01:00
Alasdair G Kergon a765e20eeb dm: move include files
Publish the dm-io, dm-log and dm-kcopyd headers in include/linux.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:55 +01:00
Heinz Mauelshagen 416cd17b19 dm log: clean interface
Clean up the dm-log interface to prepare for publishing it in include/linux.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:46 +01:00
Heinz Mauelshagen eb69aca5d3 dm kcopyd: clean interface
Clean up the kcopyd interface to prepare for publishing it in include/linux.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:44 +01:00
Heinz Mauelshagen 22a1ceb1e6 dm io: clean interface
Clean up the dm-io interface to prepare for publishing it in include/linux.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:43 +01:00
Heinz Mauelshagen 769aef30f0 dm log: move dirty region log code into separate module
Move the dirty region log code into a separate module so
other targets can share the code.

Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:39 +01:00
Robert P. J. Day c12bfc923e dm raid1: use list_split_init
Use shorter list_splice_init() for brevity.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-04-25 13:26:36 +01:00
Alasdair G Kergon 4cdc1d1fa5 dm io: write error bits form long not int
write_err is an unsigned long used with set_bit() so should not be passed
around as unsigned int.

http://bugzilla.kernel.org/show_bug.cgi?id=10271

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-28 14:45:23 -07:00
Adrian Bunk e03f1a8422 dm-raid1.c: fix NULL dereferences
This patch fixes two NULL dereferences introduced by commit
06386bbfd2 and spotted by the Coverity
checker.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-19 15:52:27 -08:00
Al Viro 39ed7adb17 dm-raid1 breakage on 64bit
test_and_set_bit() on address of uint32_t is a Bad Idea(tm)...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13 08:16:34 -08:00
Jonathan Brassow af195ac82e dm raid1: report fault status
This patch adds extra information to the mirror status output, so that
it can be determined which device(s) have failed.  For each mirror device,
a character is printed indicating the most severe error encountered.  The
characters are:
 *    A => Alive - No failures
 *    D => Dead - A write failure occurred leaving mirror out-of-sync
 *    S => Sync - A sychronization failure occurred, mirror out-of-sync
 *    R => Read - A read failure occurred, mirror data unaffected
This allows userspace to properly reconfigure the mirror set.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-02-08 02:11:39 +00:00
Jonathan Brassow 06386bbfd2 dm raid1: handle read failures
This patch gives the ability to respond-to/record device failures
that happen during read operations.  It also adds the ability to
read from mirror devices that are not the primary if they are
in-sync.

There are essentially two read paths in mirroring; the direct path
and the queued path.  When a read request is mapped, if the region
is 'in-sync' the direct path is taken; otherwise the queued path
is taken.

If the direct path is taken, we must record bio information so that
if the read fails we can retry it.  We then discover the status of
a direct read through mirror_end_io.  If the read has failed, we will
mark the device from which the read was attempted as failed (so we
don't try to read from it again), restore the bio and try again.

If the queued path is taken, we discover the results of the read
from 'read_callback'.  If the device failed, we will mark the device
as failed and attempt the read again if there is another device
where this region is known to be 'in-sync'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-02-08 02:11:37 +00:00
Jonathan Brassow b80aa7a0c2 dm raid1: fix EIO after log failure
This patch adds the ability to requeue write I/O to
core device-mapper when there is a log device failure.

If a write to the log produces and error, the pending writes are
put on the "failures" list.  Since the log is marked as failed,
they will stay on the failures list until a suspend happens.

Suspends come in two phases, presuspend and postsuspend.  We must
make sure that all the writes on the failures list are requeued
in the presuspend phase (a requirement of dm core).  This means
that recovery must be complete (because writes may be delayed
behind it) and the failures list must be requeued before we
return from presuspend.

The mechanisms to ensure recovery is complete (or stopped) was
already in place, but needed to be moved from postsuspend to
presuspend.  We rely on 'flush_workqueue' to ensure that the
mirror thread is complete and therefore, has requeued all writes
in the failures list.

Because we are using flush_workqueue, we must ensure that no
additional 'queue_work' calls will produce additional I/O
that we need to requeue (because once we return from
presuspend, we are unable to do anything about it).  'queue_work'
is called in response to the following functions:
- complete_resync_work = NA, recovery is stopped
- rh_dec (mirror_end_io) = NA, only calls 'queue_work' if it
                           is ready to recover the region
                           (recovery is stopped) or it needs
                           to clear the region in the log*
                           **this doesn't get called while
                           suspending**
- rh_recovery_end = NA, recovery is stopped
- rh_recovery_start = NA, recovery is stopped
- write_callback = 1) Writes w/o failures simply call
                   bio_endio -> mirror_end_io -> rh_dec
                   (see rh_dec above)
                   2) Writes with failures are put on
                   the failures list and queue_work is
                   called**
                   ** write_callbacks don't happen
                   during suspend **
- do_failures = NA, 'queue_work' not called if suspending
- add_mirror (initialization) = NA, only done on mirror creation
- queue_bio = NA, 1) delayed I/O scheduled before flush_workqueue
              is called.  2) No more I/Os are being issued.
              3) Re-attempted READs can still be handled.
              (Write completions are handled through rh_dec/
              write_callback - mention above - and do not
              use queue_bio.)

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-02-08 02:11:35 +00:00
Jonathan Brassow 8f0205b798 dm raid1: handle recovery failures
This patch adds the calls to 'fail_mirror' if an error occurs during
mirror recovery (aka resynchronization).  'fail_mirror' is responsible
for recording the type of error by mirror device and ensuring an event
gets raised for the purpose of notifying userspace.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-02-08 02:11:32 +00:00
Jonathan Brassow 72f4b31410 dm raid1: handle write failures
This patch gives mirror the ability to handle device failures
during normal write operations.

The 'write_callback' function is called when a write completes.
If all the writes failed or succeeded, we report failure or
success respectively.  If some of the writes failed, we call
fail_mirror; which increments the error count for the device, notes
the type of error encountered (DM_RAID1_WRITE_ERROR),  and
selects a new primary (if necessary).  Note that the primary
device can never change while the mirror is not in-sync (IOW,
while recovery is happening.)  This means that the scenario
where a failed write changes the primary and gives
recovery_complete a chance to misread the primary never happens.
The fact that the primary can change has necessitated the change
to the default_mirror field.  We need to protect against reading
garbage while the primary changes.  We then add the bio to a new
list in the mirror set, 'failures'.  For every bio in the 'failures'
list, we call a new function, '__bio_mark_nosync', where we mark
the region 'not-in-sync' in the log and properly set the region
state as, RH_NOSYNC.  Userspace must also be notified of the
failure.  This is done by 'raising an event' (dm_table_event()).
If fail_mirror is called in process context the event can be raised
right away.  If in interrupt context, the event is deferred to the
kmirrord thread - which raises the event if 'event_waiting' is set.

Backwards compatibility is maintained by ignoring errors if
the DM_FEATURES_HANDLE_ERRORS flag is not present.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2008-02-08 02:11:29 +00:00
Jonathan Brassow aa5617c553 dm raid1: add mirror_set to struct mirror
Store a pointer to the owning mirror_set structure within each mirror
structure for a subsequent patch to use.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:22 +01:00
Jonathan Brassow 6b3df0d7a5 dm log: split suspend
There are now two phases to a suspend in device-mapper -
presuspend and postsuspend.  This patch removes the
single 'suspend' in the logging API and replaces it with
'presuspend' and 'postsuspend' functions to align it
better with core device-mapper.

A subsequent patch will make use of 'presuspend'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:21 +01:00
vignesh babu 6f3c3f0afa dm: use is_power_of_2
Replacing n & (n - 1) for power of 2 check by is_power_of_2(n)

Signed-off-by: vignesh babu <vignesh.babu@wipro.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:06 +01:00
Dmitry Monakhov a72cf737e0 dm raid1: fix leakage
Add missing 'dm_io_client_destroy' to alloc_context error path.
Reorganize mirror constructor error path in order to prevent
workqueue leakage.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2007-10-20 02:01:01 +01:00
NeilBrown 6712ecf8f6 Drop 'size' argument from bio_endio and bi_end_io
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant.  Remove it.

Now there is no need for bio_endio to subtract the size completed
from bi_size.  So don't do that either.

While we are at it, change bi_end_io to return void.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:57 +02:00
Yoann Padioleau dd00cc486a some kmalloc/memset ->kzalloc (tree wide)
Transform some calls to kmalloc/memset to a single kzalloc (or kcalloc).

Here is a short excerpt of the semantic patch performing
this transformation:

@@
type T2;
expression x;
identifier f,fld;
expression E;
expression E1,E2;
expression e1,e2,e3,y;
statement S;
@@

 x =
- kmalloc
+ kzalloc
  (E1,E2)
  ...  when != \(x->fld=E;\|y=f(...,x,...);\|f(...,x,...);\|x=E;\|while(...) S\|for(e1;e2;e3) S\)
- memset((T2)x,0,E1);

@@
expression E1,E2,E3;
@@

- kzalloc(E1 * E2,E3)
+ kcalloc(E1,E2,E3)

[akpm@linux-foundation.org: get kcalloc args the right way around]
Signed-off-by: Yoann Padioleau <padator@wanadoo.fr>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <bryan.wu@analog.com>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Dave Airlie <airlied@linux.ie>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Acked-by: Pierre Ossman <drzeus-list@drzeus.cx>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Greg KH <greg@kroah.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:50 -07:00
Jonathan Brassow fc1ff9588a dm raid1: handle log failure
When writing to a mirror, the log must be updated first.  Failure
to update the log could result in the log not properly reflecting
the state of the mirror if the machine should crash.

We change the return type of the rh_flush function to give us
the ability to check if a log write was successful.  If the
log write was unsuccessful, we fail the writes to avoid the
case where the log does not properly reflect the state of the
mirror.

A follow-up patch - which is dependent on the ability to
requeue I/O's to core device-mapper - will requeue the I/O's
for retry (allowing the mirror to be reconfigured.)

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Jonathan Brassow f44db678ed dm raid1: handle resync failures
Device-mapper mirroring currently takes a best effort approach to
recovery - failures during mirror synchronization are completely ignored.
This means that regions are marked 'in-sync' and 'clean' and removed
from the hash list.  Future reads and writes that query the region
will incorrectly interpret the region as in-sync.

This patch handles failures during the recovery process.  If a failure
occurs, the region is marked as 'not-in-sync' (aka RH_NOSYNC) and added
to a new list 'failed_recovered_regions'.

Regions on the 'failed_recovered_regions' list are not marked as 'clean'
upon removal from the list.  Furthermore, if the DM_RAID1_HANDLE_ERRORS
flag is set, the region is marked as 'not-in-sync'.  This action prevents
any future read-balancing from choosing an invalid device because of the
'not-in-sync' status.

If "handle_errors" is not specified when creating a mirror (leaving the
DM_RAID1_HANDLE_ERRORS flag unset), failures will be ignored exactly as they
would be without this patch.  This is to preserve backwards compatibility with
user-space tools, such as 'pvmove'.  However, since future read-balancing
policies will rely on the correct sync status of a region, a user must choose
"handle_errors" when using read-balancing.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Jonathan Brassow 943317efdb dm raid1: clear region outside spinlock
A clear_region function is permitted to block (in practice, rare) but gets
called in rh_update_states() with a spinlock held.

The bits being marked and cleared by the above functions are used
to update the on-disk log, but are never read directly.  We can
perform these operations outside the spinlock since the
bits are only changed within one thread viz.
   - mark_region in rh_inc()
   - clear_region in rh_update_states().

So, we grab the clean_regions list items via list_splice() within the
spinlock and defer clear_region() until we iterate over the list for
deletion - similar to how the recovered_regions list is already handled.
We then move the flush() call down to ensure it encapsulates the changes
which are done by the later calls to clear_region().

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Milan Broz c95bc206da dm raid1: fix status
Fix mirror status line broken in dm-log-report-fault-status.patch:
  - space missing between two words
  - placeholder ("0") required for compatibility with a subsequent patch
  - incorrect offset parameter

Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Alasdair G Kergon 0cd3312434 dm: remove duplicate module name from error msgs
Remove explicit module name from messages as the macro now includes it
automatically.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-12 15:01:08 -07:00
Jonathan Brassow b997b82d26 dm raid1: switch rh_in_sync to blocking in do_reads
The call to rh_in_sync() in do_reads() should be allowed to block.  It is in
the mirror worker thread which already permits blocking operations.  This will
be needed to support clustered mirroring which will perform network
operations.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:48 -07:00
Jonathan Brassow f5353cd7c9 dm raid1: fix to commit pending clear region requests
With the code as it is, it is possible for oustanding clear region requests
never to get flushed when a mirror is deactivated or suspended.  This means
there will always be some resync work required when a mirror is activated,
even though it may very well be in-sync.

Always requesting the flush doesn't hurt us.  This is because the log tracks
whether any changes occurred and, if not, no flush is performed.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:48 -07:00
Milan Broz 88be163abb dm raid1: update dm io interface
This patch ports dm-raid1.c to the new dm-io interface.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Jonathan E Brassow a8e6afa236 dm raid1: add handle_errors feature flag
This patch adds the ability to specify desired features in the mirror
constructor/mapping table.

The first feature of interest is "handle_errors".  Currently, mirroring will
ignore any I/O errors from the devices.  Subsequent patches will check for
this flag and handle the errors.  If flag/feature is not present, mirror will
do nothing - maintaining backwards compatibility.

Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Jonathan E Brassow 315dcc226f dm log: report fault status
This patch reports the status of the log device so that userspace can detect
the error and take appropriate action.

Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:47 -07:00
Holger Smolinski 6ad36fe2b4 dm raid1: one kmirrord per mirror
This patch replaces the single instance of kmirrord by one instance per mirror
set.  This change is required to avoid a deadlock in kmirrord when the
persistent dirty log of a mirror itself resides on a mirror.  The single
instance of kmirrord then issues a sync write to the dirty log in write_bits
which gets deferred to kmirrord itself later in the call chain.  But kmirrord
never does the deferred work because it is still waiting for the sync
write_bits.

_mirror_sets is removed as it no longer needed, and we always flush the
workqueue before destroying it to ensure all work is complete before
destroying it.

Signed-off-by: Holger Smolinski <smolinski@de.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Jonathan E Brassow f3ee6b2f62 [PATCH] dm: log: rename complete_resync_work
The complete_resync_work function only provides the ability to change an
out-of-sync region to in-sync.  This patch enhances the function to allow us
to change the status from in-sync to out-of-sync as well, something that is
needed when a mirror write to one of the devices or an initial resync on a
given region fails.

Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:09 -08:00
Kiyoshi Ueda d2a7ad29a8 [PATCH] dm: map and endio symbolic return codes
Update existing targets to use the new symbols for return values from target
map and end_io functions.

There is no effect on behaviour.

Test results:
Done build test without errors.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:09 -08:00
David Howells c4028958b6 WorkStruct: make allyesconfig
Fix up for make allyesconfig.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:57:56 +00:00
Jonathan E Brassow 33184048dc [PATCH] dm: raid1: fix waiting for io on suspend
All device-mapper targets must complete outstanding I/O before suspending.
The mirror target generates I/O in its recovery phase and fails to wait for
it.  It needs to be tracked so we can ensure that it has completed before we
suspend.

[akpm@osdl.org: cleanup]
Signed-off-by: Jonathan E Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <dm-devel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-08 18:29:23 -08:00
Jonathan Brassow e52b8f6dbe [PATCH] dm mirror: remove trailing space from table
Remove trailing space from 'dmsetup table' output.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:15 -07:00
Daniel Kobras c06aad854f [PATCH] dm: Fix deadlock under high i/o load in raid1 setup.
On an nForce4-equipped machine with two SATA disk in raid1 setup using dmraid,
we experienced frequent deadlock of the system under high i/o load.  'cat
/dev/zero > ~/zero' was the most reliable way to reproduce them: Randomly
after a few GB, 'cp' would be left in 'D' state along with kjournald and
kmirrord.  The functions cp and kjournald were blocked in did vary, but
kmirrord's wchan always pointed to 'mempool_alloc()'.  We've seen this pattern
on 2.6.15 and 2.6.17 kernels.  http://lkml.org/lkml/2005/4/20/142 indicates
that this problem has been around even before.

So much for the facts, here's my interpretation: mempool_alloc() first tries
to atomically allocate the requested memory, or falls back to hand out
preallocated chunks from the mempool.  If both fail, it puts the calling
process (kmirrord in this case) on a private waitqueue until somebody refills
the pool.  Where the only 'somebody' is kmirrord itself, so we have a
deadlock.

I worked around this problem by falling back to a (blocking) kmalloc when
before kmirrord would have ended up on the waitqueue.  This defeats part of
the benefits of using the mempool, but at least keeps the system running.  And
it could be done with a two-line change.  Note that mempool_alloc() clears the
GFP_NOIO flag internally, and only uses it to decide whether to wait or return
an error if immediate allocation fails, so the attached patch doesn't change
behaviour in the non-deadlocking case.  Path is against current git
(2.6.18-rc4), but should apply to earlier versions as well.  I've tested on
2.6.15, where this patch makes the difference between random lockup and a
stable system.

Signed-off-by: Daniel Kobras <kobras@linux.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:28 -07:00
Alasdair G Kergon 72d9486169 [PATCH] dm: improve error message consistency
Tidy device-mapper error messages to include context information
automatically.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:36 -07:00
Jonathan Brassow ce503f59ae [PATCH] dm kcopyd: error accumulation fix
kcopyd should accumulate errors - otherwise I/O failures may be ignored
unintentionally.

And invert 'success' (used in a future patch), using a more intuitive
!(read_err || write_err).

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:35 -07:00
Kevin Corry 702ca6f0be [PATCH] dm mirror log: sector size fix
On-disk logs for dm-mirror devices are currently hard-coded to use 512 byte
hard-sector-sizes.  This patch fixes dm-log so it will work with devices with
non-512-byte hard-sector-sizes.

To maintain full compatibility, instead of moving the clean-bits bitset to a
bitset, and enlarges the disk-header buffer to encompass both the header and
the bitset.  The I/O routines for the bitset are removed, and the I/O routines
for the disk-header now also read/write the bitset.

Signed-off-by: Kevin Corry <kevcorry@us.ibm.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:35 -07:00
Neil Brown e4c8b3ba34 [PATCH] dm: mirror sector offset fix
The device-mapper core does not perform any remapping of bios before passing
them to the targets.  If a particular mapping begins part-way into a device,
targets obtain the sector relative to the start of the mapping by subtracting
ti->begin.

The dm-raid1 target didn't do this everywhere: this patch fixes it, taking
care to subtract ti->begin exactly once for each bio.

[akpm: too late for 2.6.17 - suitable for 2.6.17.x after it has settled]

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:35 -07:00
Akinobu Mita 179e09172a [PATCH] drivers: use list_move()
This patch converts the combination of list_del(A) and list_add(A, B) to
list_move(A, B) under drivers/.

Acked-by: Corey Minyard <minyard@mvista.com>
Cc: Ben Collins <bcollins@debian.org>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Alasdair Kergon <dm-devel@redhat.com>
Cc: Gerd Knorr <kraxel@bytesex.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frank Pavlic <fpavlic@de.ibm.com>
Acked-by: Matthew Wilcox <matthew@wil.cx>
Cc: Andrew Vasquez <linux-driver@qlogic.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:18 -07:00
Andrew Morton 4ee218cd67 [PATCH] dm: remove SECTOR_FORMAT
We don't know what type sector_t has.  Sometimes it's unsigned long, sometimes
it's unsigned long long.  For example on ppc64 it's unsigned long with
CONFIG_LBD=n and on x86_64 it's unsigned long long with CONFIG_LBD=n.

The way to handle all of this is to always use unsigned long long and to
always typecast the sector_t when printing it.

Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:58 -08:00
Jun'ichi Nomura 930d332a23 [PATCH] drivers/md/dm-raid1.c: Fix inconsistent mirroring after interrupted recovery
dm-mirror has potential data corruption problem: while on-disk log shows
that all disk contents are in-sync, actual contents of the disks are not
synchronized.  This problem occurs if initial recovery (synching) is
interrupted and resumed.

Attached patch fixes this problem.

Background:

rh_dec() changes the region state from RH_NOSYNC (out-of-sync) to RH_CLEAN
(in-sync), which results in the corresponding bit of clean_bits being set.

This is harmful if on-disk log is used and the map is removed/suspended
before the initial sync is completed.  The clean_bits is written down to
the on-disk log at the map removal, and, upon resume, it's read and copied
to sync_bits.  Since the recovery process refers to the sync_bits to find a
region to be recovered, the region whose state was changed from RH_NOSYNC
to RH_CLEAN is no longer recovered.

If you haven't applied dm-raid1-read-balancing.patch proposed in dm-devel
sometimes ago, the contents of the mirrored disk just corrupt silently.  If
you have, balanced read may get bogus data from out-of-sync disks.

The patch keeps RH_NOSYNC state unchanged.  It will be changed to
RH_RECOVERING when recovery starts and get reclaimed when the recovery
completes.  So it doesn't leak the region hash entry.

Description:

Keep RH_NOSYNC state unchanged when I/O on the region completes.

rh_dec() changes the region state from RH_NOSYNC (out-of-sync) to RH_CLEAN
(in-sync), which results in the corresponding bit of clean_bits being set.

This is harmful if on-disk log is used and the map is removed/suspended
before the initial sync is completed.  The clean_bits is written down to
the on-disk log at the map removal, and, upon resume, it's read and copied
to sync_bits.  Since the recovery process refers to the sync_bits to find a
region to be recovered, the region whose state was changed from RH_NOSYNC
to RH_CLEAN is no longer recovered.

If you haven't applied dm-raid1-read-balancing.patch proposed in dm-devel
sometimes ago, the contents of the mirrored disk just corrupt silently.  If
you have, balanced read may get bogus data from out-of-sync disks.

The RH_NOSYNC region will be changed to RH_RECOVERING when recovery starts
on the region and get reclaimed when the recovery completes.  So it doesn't
leak the region hash entry.

Alasdair said:

  I've analysed the relevant part of the state machine and I believe that
  the patch is correct.

  (Further work on this code is still needed - this patch has the
  side-effect of holding onto memory unnecessarily for long periods of time
  under certain workloads - but better that than corrupting data.)

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:58 -08:00
Matthew Dobson 0eaae62aba [PATCH] mempool: use common mempool kmalloc allocator
This patch changes several mempool users, all of which are basically just
wrappers around kmalloc(), to use the common mempool_kmalloc/kfree, rather
than their own wrapper function, removing a bunch of duplicated code.

Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:56:59 -08:00
Jonathan E Brassow a1a1908070 [PATCH] device-mapper raid1: add default mirror
This patch introduces a new field to the mirror_set (default_mirror) to store
the default mirror.

(A subsequent patch will allow us to change the default mirror in the event of
a failure.)

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:00 -08:00
Jonathan E Brassow 7692c5dd48 [PATCH] device-mapper raid1: drop mark_region spinlock fix
The spinlock region_lock is held while calling mark_region which can sleep.
Drop the spinlock before calling that function.

A region's state and inclusion in the clean list are altered by rh_inc and
rh_dec.  The state variable is set to RH_CLEAN in rh_dec, but only if
'pending' is zero.  It is set to RH_DIRTY in rh_inc, but not if it is already
so.  The changes to 'pending', the state, and the region's inclusion in the
clean list need to be atomicly.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:14:31 -08:00
Al Viro dd0fc66fb3 [PATCH] gfp flags annotations - part 1
- added typedef unsigned int __nocast gfp_t;

 - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
   the same warnings as far as sparse is concerned, doesn't change
   generated code (from gcc point of view we replaced unsigned int with
   typedef) and documents what's going on far better.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-08 15:00:57 -07:00
Jun'ichi Nomura 844e8d904a [PATCH] dm: fix rh_dec()/rh_inc() race in dm-raid1.c
Fix another bug in dm-raid1.c that the dirty region may stay in or be moved
to clean list and freed while in use.

It happens as follows:

   CPU0                                   CPU1
   ------------------------------------------------------------------------------
   rh_dec()
     if (atomic_dec_and_test(pending))
        <the region is still marked dirty>
                                          rh_inc()
                                            if the region is clean
                                               mark the region dirty
                                               and remove from clean list
        mark the region clean
        and move to clean list
                                                  atomic_inc(pending)

At this stage, the region is in clean list and will be mistakenly reclaimed
by rh_update_states() later.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 16:39:09 -07:00
Alasdair G Kergon 48f1f53282 [PATCH] dm-raid locking fix
This code was never designed to handle more than one instance of do_work()
running at once.

Signed-Off-By: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04 13:00:55 -07:00
Alasdair G Kergon d88854f089 [PATCH] device-mapper: dm-raid1: Limit bios to size of mirror region
Set the target's split_io field when building a dm-mirror device so
incoming bios won't span the mirror's internal regions.  Without this,
regions can be accessed while not holding correct locks and data corruption
is possible.

Reported-By: "Zhao Qian" <zhaoqian@aaastor.com>
From: Kevin Corry <kevcorry@us.ibm.com>
Signed-Off-By: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:24:11 -07:00
Linus Torvalds 1da177e4c3 Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!
2005-04-16 15:20:36 -07:00