raid5 cache could write bitmap superblock before bitmap superblock is
initialized. The bitmap superblock is less than 512B. The current code will
only copy the superblock to a new page and write the whole 512B, which will
zero the the data after the superblock. Unfortunately the data could include
bitmap, which we should preserve. The patch will make superblock read do 4k
chunk and we always copy the 4k data to new page, so the superblock write will
old data to disk and we don't change the bitmap.
Reported-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org (4.10+)
Signed-off-by: Shaohua Li <shli@fb.com>
The device owns Bitmap_sync flag needs recovery
to become in sync, and read page from this type
device could get stale status.
Also add comments for Bitmap_sync bit per the
suggestion from Shaohua and Neil.
Previous disscussion can be found here:
https://marc.info/?t=149760428900004&r=1&w=2
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Previously, the uuid debug statements were printed in little-endian
format, which wasn't consistent in machines that might not be in
little-endian byte order. With this change, the output will be
consistent for all machines with different byte-ordering.
Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Since we have switched to sync way to handle METADATA_UPDATED
msg for md-cluster, then process_metadata_update is depended
on mddev->thread->wqueue.
With the new change, clustered raid could possible hang if
array received a METADATA_UPDATED msg after array unregistered
mddev->thread, so we need to stop clustered raid (bitmap_destroy
-> bitmap_free -> md_cluster_stop) earlier than unregister
thread (mddev_detach -> md_unregister_thread).
And this change should be safe for non-clustered raid since
all writes are stopped before the destroy. Also in md_run,
we activate the personality (pers->run()) before activating
the bitmap (bitmap_create()). So it is pleasingly symmetric
to stop the bitmap (bitmap_destroy()) before stopping the
personality (__md_stop() calls pers->free()), we achieve this
by move bitmap_destroy to the beginning of __md_stop.
But we don't want to break the codes for waiting behind IO as
Shaohua mentioned, so introduce bitmap_wait_behind_writes to
call the codes, and call the new fun in both mddev_detach and
bitmap_destroy, then we will not break original behind IO code
and also fit the new condition well.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Support resize is a little complex for clustered
raid, since we need to ensure all the nodes share
the same knowledge about the size of raid.
We achieve the goal by check the sync_size which
is in each node's bitmap, we can only change the
capacity after cluster_check_sync_size returns 0.
Also, get_bitmap_from_slot is added to get a slot's
bitmap. And we exported some funcs since they are
used in cluster_check_sync_size().
We can also reuse get_bitmap_from_slot to remove
redundant code existed in bitmap_copy_from_slot.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change unrelated flags. These usage are most for superblock
write, so spearate superblock related flags. This should make the code
clearer and also fix real bugs.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
This can only be supported on personalities which ensure
that md_error() never causes an array to enter the 'failed'
state. i.e. if marking a device Faulty would cause some
data to be inaccessible, the device is status is left as
non-Faulty. This is true for RAID1 and RAID10.
If we get a failure writing metadata but the device doesn't
fail, it must be the last device so we re-write without
FAILFAST to improve chance of success. We also flag the
device as LastDev so that future metadata updates don't
waste time on failfast writes.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
We trace wheneven bitmap_unplug() finds that it needs to write
to the bitmap, or when bitmap_daemon_work() find there is work
to do.
This makes it easier to correlate bitmap updates with data writes.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
As we don't wait for writes to complete in bitmap_daemon_work, they
could still be in-flight when bitmap_unplug writes again. Or when
bitmap_daemon_work tries to write again.
This can be confusing and could risk the wrong data being written last.
So make sure we wait for old writes to complete before new writes start.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Follow err/warn distinction introduced in md.c
Join multi-part strings into single string.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
It is possible that bitmap_storage_alloc could return -ENOMEM,
and some member inside store could be allocated such as filemap.
To avoid memory leak, we need to call bitmap_file_unmap to free
those members in the bitmap_resize.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
if bitmap_create fails, the bitmap is already cleaned up and the returned value
is an error number. We can't do the cleanup again.
Reported-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Shaohua Li <shli@fb.com>
Changing the location changes a lot of things. Holding the lock to avoid race.
This makes the .quiesce called with mddev lock hold too.
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Separate the op from the rq_flag_bits and have md
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has submit_bh users pass in the operation and flags separately,
so submit_bh_wbc can setup the bio op and bi_rw flags on the bio that
is submitted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The in-memory bitmap is not ready when node joins cluster,
so it doesn't make sense to make gather_all_resync_info()
called so earlier, we need to call it after the node's
bitmap is setup. Also, recv_thread could be wake up after
node joins cluster, but it could cause problem if node
receives RESYNCING message without persionality since
mddev->pers->quiesce is called in process_suspend_info.
This commit introduces a new cluster interface load_bitmaps
to fix above problems, load_bitmaps is called in bitmap_load
where bitmap and persionality are ready, and load_bitmaps
does the following tasks:
1. call gather_all_resync_info to load all the node's
bitmap info.
2. set MD_CLUSTER_ALREADY_IN_CLUSTER bit to recv_thread
could be wake up, and wake up recv_thread if there is
pending recv event.
Then ack_bast only wakes up recv_thread after IN_CLUSTER
bit is ready otherwise MD_CLUSTER_PENDING_RESYNC_EVENT is
set.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
drivers/md/bitmap.c:2049:6-11: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.
NULL check before some freeing functions is not needed.
Based on checkpatch warning
"kfree(NULL) is safe this check is probably not required"
and kfreeaddr.cocci by Julia Lawall.
Generated by: scripts/coccinelle/free/ifnullfree.cocci
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
This patch is doing two distinct but related things.
1. It adds bitmap_unplug() for the main bitmap (mddev->bitmap). As bit
have been set, BITMAP_PAGE_DIRTY is set so bitmap_deamon_work() will
not write those pages out in its regular scans, only bitmap_unplug()
will. If there are no writes to the array, bitmap_unplug() won't be
called, so we need to call it explicitly here.
2. bitmap_write_all() is a bit of a confusing interface as it doesn't
actually write anything. The current code for writing "bitmap" works
but this change makes it a bit clearer.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The pnum passed to set_page_attr and test_page_attr should from
0 to storage.file_pages - 1, but bitmap_file_set_bit and
bitmap_file_clear_bit call set_page_attr and test_page_attr with
page->index parameter while page->index has already added node_offset
before.
So we need to minus node_offset in both bitmap_file_clear_bit
and bitmap_file_set_bit.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The offset is wrong in bitmap_storage_alloc, we should
set it like below in bitmap_init_from_disk().
node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE));
Because 'offset' is only assigned to 'page->index' and
that is usually over-written by read_sb_page. So it does
not cause problem in general, but it still need to be fixed.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
If the node received RESYNCING message which means
another node will perform resync with the area, then
we don't want to do it again in another node.
Let's set RESYNC_MASK and clear NEEDED_MASK for the
region from old-low to new-low which has finished
syncing, and the region from old-hi to new-hi is about
to syncing, bitmap_sync_with_cluste is introduced for
the purpose.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The in-memory bitmap for raid is allocated on demand,
then for cluster scenario, it is possible that slave
node which received RESYNCING message doesn't have the
in-memory bitmap when master node is perform resyncing,
so we can't make bitmap is match up well among each
nodes.
So for cluster scenario, we need always preserve the
bitmap, and ensure the page will not be freed. And a
no_hijack flag is introduced to both bitmap_checkpage
and bitmap_get_counter, which makes cluster raid returns
fail once allocate failed.
And the next patch is relied on this change since it
keeps sync bitmap among each nodes during resyncing
stage.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Pull MD fixes from Shaohua Li:
"This update mainly fixes bugs:
- fix error handling (Guoqing)
- fix a crash when a disk is hotremoved (me)
- fix a dead loop (Wei Fang)"
* tag 'md/4.6-rc2-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md/bitmap: clear bitmap if bitmap_create failed
MD: add rdev reference for super write
md: fix a trivial typo in comments
md:raid1: fix a dead loop when read from a WriteMostly disk
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If bitmap_create returns an error, we need to call
either bitmap_destroy or bitmap_free to do clean up,
and the selection is based on mddev->bitmap is set
or not.
And the sysfs_put(bitmap->sysfs_can_clear) is moved
from bitmap_destroy to bitmap_free, and the comment
of bitmap_create is changed as well.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The "return 0" is not needed since bitmap_checkpage
will finally return 0 for the case.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
daemon_sleep is an unsigned, so testing if it's 0 or less than 1 does
the same thing.
Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Suspending the entire device for resync could take too long. Resync
in small chunks.
cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:
1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
list.
bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels
to assemble a clustered device.
In order to maximize compatibility, the major version is set to
BITMAP_MAJOR_CLUSTERED *only* if the bitmap is clustered.
Added MD_FEATURE_CLUSTERED in order to return error for older
kernels which would assemble MD even if the bitmap is corrupted.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Passing -1 to bitmap_storage_alloc() causes page->index to be set to
-1, which is quite problematic.
So only pass ->cluster_slot if mddev_is_clustered().
Fixes: b97e92574c ("Use separate bitmaps for each nodes in the cluster")
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: NeilBrown <neilb@suse.com>
Several are tagged for -stable.
A few aren't because they are not very, serious or
because they are in the 'experimental' cluster code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJVsbRgAAoJEDnsnt1WYoG54BcQAJlVxcGdMXNAbFAfhToH+cwY
DcCKmnXiu3TCcK/gtr0SlUKIhv7kPE2HaGbTko3g5/uTk7SCuMSWRbjpQJr1u98U
9VUPZV0RGLEcQXgsjG3sobEtdSYMSn1/BpdJpmsn/Q6cyTheiEJA9fEghn9F0Iw9
3Ctc56aL0nKsnpeRTvWmKR3F995L2Ene+pIHPbSqTlQcGU+DdxQL/iY9YspMzeih
6SWQICo59+pGSRQdKaU968nZ1a8hJCvhV2NbW0JsJMo9iGo5htCJLdxZatscZV6E
xAppCRymW/Nl0n59wCOBhUp+0skVhtYZ1UmNdx/vUA7vcZJX85vIG7WGaaSQAmlF
Y4nNMSaX6SWbt/oVgq+JiD39T1oFZN5N5Fh9juqcp3fshiWLO74cnTwea1LtkQZX
LfW2HhajDUgoJqoXL6ppKDK8l7alQdcaoYn0C6SfqRZ+PLB5ERrSBHNZpKDbI9aw
CFW93lHTtlwtwZn0S7k06mCG8ilO4iPthUVaJEeI1kUWVKY0Ju4xnVTt/7jCroUa
Km14KdWeZsS8j9/xr6FYBjarjuh7M9SoflgUK84txE+PIZsSczyrErpBkMf2YBWQ
0KduqQabXa1XYWGtZ7Uhj6iOOGNBcTcZsww8cBEDOJEDyCtSs5w/iPmsFCEGUuo2
YTz6qKpYPgB1VzK2PoBG
=6ZCK
-----END PGP SIGNATURE-----
Merge tag 'md/4.2-fixes' of git://neil.brown.name/md
Pull md fixes from Neil Brown:
"Some md fixes for 4.2
Several are tagged for -stable.
A few aren't because they are not very, serious or because they are in
the 'experimental' cluster code"
* tag 'md/4.2-fixes' of git://neil.brown.name/md:
md/raid5: clear R5_NeedReplace when no longer needed.
Fix read-balancing during node failure
md-cluster: fix bitmap sub-offset in bitmap_read_sb
md: Return error if request_module fails and returns positive value
md: Skip cluster setup in case of error while reading bitmap
md/raid1: fix test for 'was read error from last working device'.
md: Skip cluster setup for dm-raid
md: flush ->event_work before stopping array.
md/raid10: always set reshape_safe when initializing reshape_position.
md/raid5: avoid races when changing cache size.
bitmap_read_sb is modifying mddev->bitmap_info.offset. This works for
the first bitmap read. However, when multiple bitmaps need to be opened
by the same node, it ends up corrupting the offset. Fix it by using a
local variable.
Also, bitmap_read_sb is not required in bitmap_copy_from_slot since
it is called in bitmap_create. Remove bitmap_read_sb().
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
If the bitmap read fails, the error code set is -EINVAL. However,
we don't check for errors and go ahead with cluster_setup.
Skip the cluster setup in case of error.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
There is a bug that the bitmap superblock isn't initialised properly for
dm-raid, so a new field can have garbage in new fields.
(dm-raid does initialisation in the kernel - md initialised the
superblock in mdadm).
This means that for dm-raid we cannot currently trust the new ->nodes
field. So:
- use __GFP_ZERO to initialise the superblock properly for all new
arrays
- initialise all fields in bitmap_info in bitmap_new_disk_sb
- ignore ->nodes for dm arrays (yes, this is a hack)
This bug exposes dm-raid to bug in the (still experimental) md-cluster
code, so it is suitable for -stable. It does cause crashes.
References: https://bugzilla.kernel.org/show_bug.cgi?id=100491
Cc: stable@vger.kernel.org (v4.1)
Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Pull more vfs updates from Al Viro:
"Assorted VFS fixes and related cleanups (IMO the most interesting in
that part are f_path-related things and Eric's descriptor-related
stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
fs-cache series, DAX patches, Jan's file_remove_suid() work"
[ I'd say this is much more than "fixes and related cleanups". The
file_table locking rule change by Eric Dumazet is a rather big and
fundamental update even if the patch isn't huge. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
9p: cope with bogus responses from server in p9_client_{read,write}
p9_client_write(): avoid double p9_free_req()
9p: forgetting to cancel request on interrupted zero-copy RPC
dax: bdev_direct_access() may sleep
block: Add support for DAX reads/writes to block devices
dax: Use copy_from_iter_nocache
dax: Add block size note to documentation
fs/file.c: __fget() and dup2() atomicity rules
fs/file.c: don't acquire files->file_lock in fd_install()
fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
vfs: avoid creation of inode number 0 in get_next_ino
namei: make set_root_rcu() return void
make simple_positive() public
ufs: use dir_pages instead of ufs_dir_pages()
pagemap.h: move dir_pages() over there
remove the pointless include of lglock.h
fs: cleanup slight list_entry abuse
xfs: Correctly lock inode when removing suid and file capabilities
fs: Call security_ops->inode_killpriv on truncate
fs: Provide function telling whether file_remove_privs() will do anything
...
Evaluating "&mddev->disks" is simple pointer arithmetic, so
it does not need 'rcu' annotations - no dereferencing is happening.
Also enhance the comment to explain that 'rdev' in that case
is not actually a pointer to an rdev.
Reported-by: Patrick Marlier <patrick.marlier@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
the clustered md:
1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
clear the Faulty bit in their respective rdev->flags.
2. The node initiating re-add, gathers the bitmaps of all nodes
and copies them into the local bitmap. It does not clear the bitmap
from which it is copying.
3. Initiating node schedules a md recovery to sync the devices.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The calculations of bitmap offset is incorrect with respect to bits to bytes
conversion.
Also, remove an irrelevant duplicate message.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
neilb: modified to not corrupt ->resync_max_sectors.
sector_div usage fixed by Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NeilBrown <neilb@suse.de>
bitmap_copy_from_slot reads the bitmap from the slot mentioned.
It then copies the set bits to the node local bitmap.
This is helper function for the resync operation on node failure.
bitmap_set_memory_bits() currently assumes it is only run at startup and that
they bitmap is currently empty. So if it finds that a region is already
marked as dirty, it won't mark it dirty again. Change bitmap_set_memory_bits()
to always set the NEEDED_MASK bit if 'needed' is set.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
On-disk format:
0 4k 8k 12k
-------------------------------------------------------------------
| idle | md super | bm super [0] + bits |
| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
| bm bits [3, contd] | | |
Bitmap super has a field nodes, which defines the maximum number
of nodes the device can use. While reading the bitmap super, if
the cluster finds out that the number of nodes is > 0:
1. Requests the md-cluster module.
2. Calls md_cluster_ops->join(), which sets up clustering such as
joining DLM lockspace.
Since the first time, the first bitmap is read. After the call
to the cluster_setup, the bitmap offset is adjusted and the
superblock is re-read. This also ensures the bitmap is read
the bitmap lock (when bitmap lock is introduced in later patches)
Questions:
1. cluster name is repeated in all bitmap supers. Is that okay?
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
DLM offers callbacks when a node fails and the lock remastery
is performed:
1. recover_prep: called when DLM discovers a node is down
2. recover_slot: called when DLM identifies the node and recovery
can start
3. recover_done: called when all nodes have completed recover_slot
recover_slot() and recover_done() are also called when the node joins
initially in order to inform the node with its slot number. These slot
numbers start from one, so we deduct one to make it start with zero
which the cluster-md code uses.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
md_cluster_info stores the cluster information in the MD device.
The join() is called when mddev detects it is a clustered device.
The main responsibilities are:
1. Setup a DLM lockspace
2. Setup all initial locks such as super block locks and bitmap lock (will come later)
The leave() clears up the lockspace and all the locks held.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Most attributes can be read safely without any locking.
A race might lead to a slightly out-dated value, but nothing wrong.
We already have locking in some places where needed.
All that remains is can_clear_show(), behind_writes_used_show()
and action_show() which are easily fixed.
Signed-off-by: NeilBrown <neilb@suse.de>