Commit Graph

4183 Commits

Author SHA1 Message Date
Linus Torvalds 564884fbde Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A set of fixes that wasn't included in the first merge window pull
  request.  This pull request contains:

   - A set of NVMe fixes from Keith, and one from Nic for the integrity
     side of it.

   - Fix from Ming, clearing ->mq_ops if we don't successfully setup a
     queue for multiqueue.

   - A set of stability fixes for bcache from Jiri, and also marking
     bcache as orphaned as it's no longer actively maintained (in
     mainline, at least)"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: clear q->mq_ops if init fail
  MAINTAINERS: mark bcache as orphan
  bcache: bch_gc_thread() is not freezable
  bcache: bch_allocator_thread() is not freezable
  bcache: bch_writeback_thread() is not freezable
  nvme/host: Add missing blk_integrity tag_size + flags assignments
  NVMe: Add device ID's with stripe quirk
  NVMe: Short-cut removal on surprise hot-unplug
  NVMe: Allow user initiated rescan
  NVMe: Reduce driver log spamming
  NVMe: Unbind driver on failure
  NVMe: Delete only created queues
  NVMe: Allocate queues only for online cpus
2016-05-27 14:28:09 -07:00
Jiri Kosina 29e6c57cc7 bcache: bch_gc_thread() is not freezable
bch_gc_thread() doesn't mark itself freezable, so calling try_to_freeze()
in its context is just an expensive no-op.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-24 09:00:45 -06:00
Jiri Kosina 770b8ce400 bcache: bch_allocator_thread() is not freezable
bch_allocator_thread() is calling try_to_freeze(), but that's just an
expensive no-op given the fact that the thread is not marked freezable.

Bucket allocator has to be up and running to the very last stages of the
suspend, as the bcache I/O that's in flight (think of writing an
hibernation image to a swap device served by bcache).

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-24 09:00:43 -06:00
Jiri Kosina 7c87df9c15 bcache: bch_writeback_thread() is not freezable
bch_writeback_thread() is calling try_to_freeze(), but that's just an
expensive no-op given the fact that the thread is not marked freezable.

I/O helper kthreads, exactly such as the bcache writeback thread, actually
shouldn't be freezable, because they are potentially necessary for
finalizing the image write-out.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-24 09:00:40 -06:00
Linus Torvalds feaa7cb5c5 Merge tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD updates from Shaohua Li:
 "Several patches from Guoqing fixing md-cluster bugs and several
  patches from Heinz fixing dm-raid bugs"

* tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md-cluster: check the return value of process_recvd_msg
  md-cluster: gather resync infos and enable recv_thread after bitmap is ready
  md: set MD_CHANGE_PENDING in a atomic region
  md: raid5: add prerequisite to run underneath dm-raid
  md: raid10: add prerequisite to run underneath dm-raid
  md: md.c: fix oops in mddev_suspend for raid0
  md-cluster: fix ifnullfree.cocci warnings
  md-cluster/bitmap: unplug bitmap to sync dirty pages to disk
  md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit
  md-cluster/bitmap: fix wrong calcuation of offset
  md-cluster: sync bitmap when node received RESYNCING msg
  md-cluster: always setup in-memory bitmap
  md-cluster: wakeup thread if activated a spare disk
  md-cluster: change array_sectors and update size are not supported
  md-cluster: fix locking when node joins cluster during message broadcast
  md-cluster: unregister thread if err happened
  md-cluster: wake up thread to continue recovery
  md-cluser: make resync_finish only called after pers->sync_request
  md-cluster: change resync lock from asynchronous to synchronous
2016-05-19 17:25:13 -07:00
Linus Torvalds b80fed9595 - based on Jens' 'for-4.7/core' to have DM thinp's discard support use
bio_inc_remaining() and the block core's new async
   __blkdev_issue_discard() interface
 
 - make DM multipath's fast code-paths lockless, using lockless_deference,
   to significantly improve large NUMA performance when using blk-mq.  The
   m->lock spinlock contention was a serious bottleneck.
 
 - a few other small code cleanups and Documentation fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXNdGVAAoJEMUj8QotnQNaYYgH/Rf2am46A78kcR5b9nN2I+Tb
 +MkqQyf8mXUzNHOu3v93CVugT+tBZuJcpHPJgCSc/1GXtgsjHLvbkO2Mc+Ioe45S
 PlUA3HdRzxHSJ365SdYvT+bY+QQlGiySelSBrJHlikXC88kz3wqyQ146BT1Rw/w+
 t0mi1liNJtZHsuH+3uO9uxe5+H7476lB84i79Kz0x8Ygv5+urgaSvDBRO5EH/hkJ
 LN2WJWHDQLT4MtHKCuiMiLpu/1HGvISN2QrMPsFjC1d1DbbZvRWAxYDwGaP/C277
 IflPo7sA/nds5T2vqb0fRTPuxBnzXdFMMvf+VQX7pjCnxlhfaxBkvNtnFpxW+oA=
 =iCyS
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.7-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - based on Jens' 'for-4.7/core' to have DM thinp's discard support use
   bio_inc_remaining() and the block core's new async __blkdev_issue_discard()
   interface

 - make DM multipath's fast code-paths lockless, using lockless_deference,
   to significantly improve large NUMA performance when using blk-mq.
   The m->lock spinlock contention was a serious bottleneck.

 - a few other small code cleanups and Documentation fixes

* tag 'dm-4.7-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm thin: unroll issue_discard() to create longer discard bio chains
  dm thin: use __blkdev_issue_discard for async discard support
  dm thin: remove __bio_inc_remaining() and switch to using bio_inc_remaining()
  dm raid: make sure no feature flags are set in metadata
  dm ioctl: drop use of __GFP_REPEAT in copy_params()'s __vmalloc() call
  dm stats: fix spelling mistake in Documentation
  dm cache: update cache-policies.txt now that mq is an alias for smq
  dm mpath: eliminate use of spinlock in IO fast-paths
  dm mpath: move trigger_event member to the end of 'struct multipath'
  dm mpath: use atomic_t for counting members of 'struct multipath'
  dm mpath: switch to using bitops for state flags
  dm thin: Remove return statement from void function
  dm: remove unused mapped_device argument from free_tio()
2016-05-17 16:13:00 -07:00
Linus Torvalds 24b9f0cf00 Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "On top of the core pull request, this is the drivers pull request for
  this merge window.  This contains:

   - Switch drivers to the new write back cache API, and kill off the
     flush flags.  From me.

   - Kill the discard support for the STEC pci-e flash driver.  It's
     trivially broken, and apparently unmaintained, so it's safer to
     just remove it.  From Jeff Moyer.

   - A set of lightnvm updates from the usual suspects (Matias/Javier,
     and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei
     Tao.

   - A set of updates for NVMe:

        - Turn the controller state management into a proper state
          machine.  From Christoph.

        - Shuffling of code in preparation for NVMe-over-fabrics, also
          from Christoph.

        - Cleanup of the command prep part from Ming Lin.

        - Rewrite of the discard support from Ming Lin.

        - Deadlock fix for namespace removal from Ming Lin.

        - Use the now exported blk-mq tag helper for IO termination.
          From Sagi.

        - Various little fixes from Christoph, Guilherme, Keith, Ming
          Lin, Wang Sheng-Hui.

   - Convert mtip32xx to use the now exported blk-mq tag iter function,
     from Keith"

* 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits)
  lightnvm: reserved space calculation incorrect
  lightnvm: rename nr_pages to nr_ppas on nvm_rq
  lightnvm: add is_cached entry to struct ppa_addr
  lightnvm: expose gennvm_mark_blk to targets
  lightnvm: remove mgt targets on mgt removal
  lightnvm: pass dma address to hardware rather than pointer
  lightnvm: do not assume sequential lun alloc.
  nvme/lightnvm: Log using the ctrl named device
  lightnvm: rename dma helper functions
  lightnvm: enable metadata to be sent to device
  lightnvm: do not free unused metadata on rrpc
  lightnvm: fix out of bound ppa lun id on bb tbl
  lightnvm: refactor set_bb_tbl for accepting ppa list
  lightnvm: move responsibility for bad blk mgmt to target
  lightnvm: make nvm_set_rqd_ppalist() aware of vblks
  lightnvm: remove struct factory_blks
  lightnvm: refactor device ops->get_bb_tbl()
  lightnvm: introduce nvm_for_each_lun_ppa() macro
  lightnvm: refactor dev->online_target to global nvm_targets
  lightnvm: rename nvm_targets to nvm_tgt_type
  ...
2016-05-17 16:03:32 -07:00
Joe Thornber 202bae5293 dm thin: unroll issue_discard() to create longer discard bio chains
There is little benefit to doing this but it does structure DM thinp's
code to more cleanly use the __blkdev_issue_discard() interface --
particularly in passdown_double_checking_shared_status().

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-13 09:04:20 -04:00
Mike Snitzer 3dba53a958 dm thin: use __blkdev_issue_discard for async discard support
With commit 38f2525533 ("block: add __blkdev_issue_discard") DM thinp
no longer needs to carry its own async discard method.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-05-13 09:03:52 -04:00
Mike Snitzer 13e4f8a695 dm thin: remove __bio_inc_remaining() and switch to using bio_inc_remaining()
DM thinp's use of bio_inc_remaining() is critical to ensure the original
parent discard bio isn't completed before sub-discards have.  DM thinp
needs this due to the extra quiescing that occurs, via multiple DM thinp
mappings, while processing large discards.  As such DM thinp must build
the async discard bio chain after some delay -- so bio_inc_remaining()
is used to enable DM thinp to take a reference on the original parent
discard bio for each mapping.  This allows the immediate use of
bio_endio() on that discard bio; but with the understanding that the
actual completion won't occur until each of the sub-discards'
per-mapping references are dropped.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2016-05-13 09:03:52 -04:00
Heinz Mauelshagen 4c9971ca6a dm raid: make sure no feature flags are set in metadata
Given we don't yet support any feature flags in the dm-raid ondisk
metadata (see: 'features' member of 'struct dm_raid_superblock'),
add a check to ensure no flags are actually set, if any features are
set reject the activation of the RAID mapping.

This is to prevent possible data corruption in case of a kernel
downgrade when there'll potentially be feature flags set by a future
dm-raid target.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-13 09:03:51 -04:00
Guoqing Jiang 1fa9a1ad0a md-cluster: check the return value of process_recvd_msg
We don't need to run the full path of recv_daemon
if process_recvd_msg doesn't return 0.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09 09:24:04 -07:00
Guoqing Jiang 51e453aecb md-cluster: gather resync infos and enable recv_thread after bitmap is ready
The in-memory bitmap is not ready when node joins cluster,
so it doesn't make sense to make gather_all_resync_info()
called so earlier, we need to call it after the node's
bitmap is setup. Also, recv_thread could be wake up after
node joins cluster, but it could cause problem if node
receives RESYNCING message without persionality since
mddev->pers->quiesce is called in process_suspend_info.

This commit introduces a new cluster interface load_bitmaps
to fix above problems, load_bitmaps is called in bitmap_load
where bitmap and persionality are ready, and load_bitmaps
does the following tasks:

1. call gather_all_resync_info to load all the node's
   bitmap info.
2. set MD_CLUSTER_ALREADY_IN_CLUSTER bit to recv_thread
   could be wake up, and wake up recv_thread if there is
   pending recv event.

Then ack_bast only wakes up recv_thread after IN_CLUSTER
bit is ready otherwise MD_CLUSTER_PENDING_RESYNC_EVENT is
set.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09 09:24:03 -07:00
Guoqing Jiang 85ad1d13ee md: set MD_CHANGE_PENDING in a atomic region
Some code waits for a metadata update by:

1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
2. setting MD_CHANGE_PENDING and waking the management thread
3. waiting for MD_CHANGE_PENDING to be cleared

If the first two are done without locking, the code in md_update_sb()
which checks if it needs to repeat might test if an update is needed
before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
in the wait returning early.

So make sure all places that set MD_CHANGE_PENDING are atomicial, and
bit_clear_unless (suggested by Neil) is introduced for the purpose.

Cc: Martin Kepplinger <martink@posteo.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: <linux-kernel@vger.kernel.org>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09 09:24:02 -07:00
Heinz Mauelshagen fe67d19a2d md: raid5: add prerequisite to run underneath dm-raid
In case md runs underneath the dm-raid target, the mddev does not have
a request queue or gendisk, thus avoid accesses.

This patch adds a missing conditional to the raid5 personality.

Signed-of-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09 09:24:02 -07:00
Heinz Mauelshagen 859644f0fa md: raid10: add prerequisite to run underneath dm-raid
In case md runs underneath the dm-raid target, the mddev does not have
a request queue or gendisk, thus avoid accesses to it.

This patch adds two missing conditionals to the raid10 personality.

Signed-of-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09 09:24:01 -07:00
Heinz Mauelshagen 092398dce8 md: md.c: fix oops in mddev_suspend for raid0
Introduced by upstream commit 70d9798b95

The raid0 personality does not create mddev->thread as oposed to
other personalities leading to its unconditional access in
mddev_suspend() causing an oops.

Patch checks for mddev->thread in order to keep the
intention of aforementioned commit.

Fixes: 70d9798b95 ("MD: warn for potential deadlock")
Cc: stable@vger.kernel.org (4.5+)
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09 09:23:23 -07:00
Michal Hocko 72f6d8d8c9 dm ioctl: drop use of __GFP_REPEAT in copy_params()'s __vmalloc() call
copy_params()'s use of __GFP_REPEAT for the __vmalloc() call doesn't make much
sense because vmalloc doesn't rely on costly high order allocations.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-05 15:25:55 -04:00
Mike Snitzer 2da1610ae2 dm mpath: eliminate use of spinlock in IO fast-paths
The primary motivation of this commit is to improve the scalability of
DM multipath on large NUMA systems where m->lock spinlock contention has
been proven to be a serious bottleneck on really fast storage.

The ability to atomically read a pointer, using lockless_dereference(),
is leveraged in this commit.  But all pointer writes are still protected
by the m->lock spinlock (which is fine since these all now occur in the
slow-path).

The following functions no longer require the m->lock spinlock in their
fast-path: multipath_busy(), __multipath_map(), and do_end_io()

And choose_pgpath() is modified to _not_ update m->current_pgpath unless
it also switches the path-group.  This is done to avoid needing to take
the m->lock everytime __multipath_map() calls choose_pgpath().
But m->current_pgpath will be reset if it is failed via fail_path().

Suggested-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-05 15:25:52 -04:00
Mike Snitzer 20800cb345 dm mpath: move trigger_event member to the end of 'struct multipath'
Allows the 'work_mutex' member to no longer cross a cacheline.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-05 15:25:52 -04:00
Mike Snitzer 91e968aa60 dm mpath: use atomic_t for counting members of 'struct multipath'
The use of atomic_t for nr_valid_paths, pg_init_in_progress and
pg_init_count will allow relaxing the use of the m->lock spinlock.

Suggested-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-05 15:25:51 -04:00
Mike Snitzer 518257b132 dm mpath: switch to using bitops for state flags
Mechanical change that doesn't make any real effort to reduce the use of
m->lock; that will come later (once atomics are used for counters, etc).

Suggested-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-05 15:25:50 -04:00
Amitoj Kaur Chawla 813923b1a2 dm thin: Remove return statement from void function
Return statement at the end of a void function is useless.

The Coccinelle semantic patch used to make this change is as follows:
//<smpl>
@@
identifier f;
expression e;
@@
void f(...) {
<...
- return
  e;
...>
}
//</smpl>

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-05 15:25:50 -04:00
Mike Snitzer cfae7529b5 dm: remove unused mapped_device argument from free_tio()
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-05-05 15:25:49 -04:00
kbuild test robot bc47e84258 md-cluster: fix ifnullfree.cocci warnings
drivers/md/bitmap.c:2049:6-11: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.

 NULL check before some freeing functions is not needed.

 Based on checkpatch warning
 "kfree(NULL) is safe this check is probably not required"
 and kfreeaddr.cocci by Julia Lawall.

Generated by: scripts/coccinelle/free/ifnullfree.cocci

Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Guoqing Jiang c84400c89f md-cluster/bitmap: unplug bitmap to sync dirty pages to disk
This patch is doing two distinct but related things.

1. It adds bitmap_unplug() for the main bitmap (mddev->bitmap).  As bit
have been set, BITMAP_PAGE_DIRTY is set so bitmap_deamon_work() will
not write those pages out in its regular scans, only bitmap_unplug()
will.  If there are no writes to the array, bitmap_unplug() won't be
called, so we need to call it explicitly here.

2. bitmap_write_all() is a bit of a confusing interface as it doesn't
actually write anything.  The current code for writing "bitmap" works
but this change makes it a bit clearer.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Guoqing Jiang 23cea66a37 md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit
The pnum passed to set_page_attr and test_page_attr should from
0 to storage.file_pages - 1, but bitmap_file_set_bit and
bitmap_file_clear_bit call set_page_attr and test_page_attr with
page->index parameter while page->index has already added node_offset
before.

So we need to minus node_offset in both bitmap_file_clear_bit
and bitmap_file_set_bit.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Guoqing Jiang 7f86ffed9b md-cluster/bitmap: fix wrong calcuation of offset
The offset is wrong in bitmap_storage_alloc, we should
set it like below in bitmap_init_from_disk().

node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE));

Because 'offset' is only assigned to 'page->index' and
that is usually over-written by read_sb_page. So it does
not cause problem in general, but it still need to be fixed.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Guoqing Jiang 18c9ff7f48 md-cluster: sync bitmap when node received RESYNCING msg
If the node received RESYNCING message which means
another node will perform resync with the area, then
we don't want to do it again in another node.

Let's set RESYNC_MASK and clear NEEDED_MASK for the
region from old-low to new-low which has finished
syncing, and the region from old-hi to new-hi is about
to syncing, bitmap_sync_with_cluste is introduced for
the purpose.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Guoqing Jiang c9d6503228 md-cluster: always setup in-memory bitmap
The in-memory bitmap for raid is allocated on demand,
then for cluster scenario, it is possible that slave
node which received RESYNCING message doesn't have the
in-memory bitmap when master node is perform resyncing,
so we can't make bitmap is match up well among each
nodes.

So for cluster scenario, we need always preserve the
bitmap, and ensure the page will not be freed. And a
no_hijack flag is introduced to both bitmap_checkpage
and bitmap_get_counter, which makes cluster raid returns
fail once allocate failed.

And the next patch is relied on this change since it
keeps sync bitmap among each nodes during resyncing
stage.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Guoqing Jiang a578183ed9 md-cluster: wakeup thread if activated a spare disk
When a device is re-added, it will ultimately need
to be activated and that happens in md_check_recovery,
so we need to set MD_RECOVERY_NEEDED right after
remove_and_add_spares.

A specifical issue without the change is that when
one node perform fail/remove/readd on a disk, but
slave nodes could not add the disk back to array as
expected (added as missed instead of in sync). So
give slave nodes a chance to do resync.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Guoqing Jiang ab5a98b132 md-cluster: change array_sectors and update size are not supported
Currently, some features are not supported yet,
such as change array_sectors and update size, so
return EINVAL for them and listed it in document.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Guoqing Jiang 1535212c54 md-cluster: fix locking when node joins cluster during message broadcast
If a node joins the cluster while a message broadcast
is under way, a lock issue could happen as follows.

For a cluster which included two nodes, if node A is
calling __sendmsg before up-convert CR to EX on ack,
and node B released CR on ack. But if a new node C
joins the cluster and it doesn't receive the message
which A sent before, so it could hold CR on ack before
A up-convert CR to EX on ack.

So a node joining the cluster should get an EX lock on
the "token" first to ensure no broadcast is ongoing,
then release it after held CR on ack.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Guoqing Jiang 5b0fb33e8a md-cluster: unregister thread if err happened
The two threads need to be unregistered if a node
can't join cluster successfully.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Guoqing Jiang eb315cd093 md-cluster: wake up thread to continue recovery
In recovery case, we need to set MD_RECOVERY_NEEDED
and wake up thread only if recover is not finished.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Guoqing Jiang 2c97cf1385 md-cluser: make resync_finish only called after pers->sync_request
It is not reasonable that cluster raid to release resync
lock before the last pers->sync_request has finished.

As the metadata will be changed when node performs resync,
we need to inform other nodes to update metadata, so the
MD_CHANGE_PENDING flag is set before finish resync.

Then metadata_update_finish is move ahead to ensure that
METADATA_UPDATED msg is sent before finish resync, and
metadata_update_start need to be run after "repeat:" label
accordingly.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Guoqing Jiang 41a9a0dcf8 md-cluster: change resync lock from asynchronous to synchronous
If multiple nodes choose to attempt do resync at the same time
they need to be serialized so they don't duplicate effort. This
serialization is done by locking the 'resync' DLM lock.

Currently if a node cannot get the lock immediately it doesn't
request notification when the lock becomes available (i.e.
DLM_LKF_NOQUEUE is set), so it may not reliably find out when it
is safe to try again.

Rather than trying to arrange an async wake-up when the lock
becomes available, switch to using synchronous locking - this is
a lot easier to think about.  As it is not permitted to block in
the 'raid1d' thread, move the locking to the resync thread.  So
the rsync thread is forked immediately, but it blocks until the
resync lock is available. Once the lock is locked it checks again
if any resync action is needed.

A particular symptom of the current problem is that a node can
get stuck with "resync=pending" indefinitely.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-04 12:39:35 -07:00
Linus Torvalds 98bcf28636 Merge tag 'md/4.6-rc6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD fixes from Shaohua Li:
 "This update includes several trival fixes.  The only important one is
  to fix MD bio merge, which has big performance impact"

* tag 'md/4.6-rc6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  raid5: delete unnecessary warnning
  MD: make bio mergeable
  md/raid0: remove empty line printk from dump_zones
  md/raid0: fix uninitialized variable bug
2016-05-02 12:22:51 -07:00
Shaohua Li b8a0b8e946 raid5: delete unnecessary warnning
If device has R5_LOCKED set, it's legit device has R5_SkipCopy set and page !=
orig_page. After R5_LOCKED is clear, handle_stripe_clean_event will clear the
SkipCopy flag and set page to orig_page. So the warning is unnecessary.

Reported-by: Joey Liao <joeyliao@qnap.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-29 14:18:03 -07:00
Shaohua Li 9c573de328 MD: make bio mergeable
blk_queue_split marks bio unmergeable, which makes sense for normal bio.
But if dispatching the bio to underlayer disk, the blk_queue_split
checks are invalid, hence it's possible the bio becomes mergeable.

In the reported bug, this bug causes trim against raid0 performance slash
https://bugzilla.kernel.org/show_bug.cgi?id=117051

Reported-and-tested-by: Park Ju Hyung <qkrwngud825@gmail.com>
Fixes: 6ac45aeb6bca(block: avoid to merge splitted bio)
Cc: stable@vger.kernel.org (v4.3+)
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Reviewed-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-25 18:21:33 -07:00
Michał Pecio b297874a2d md/raid0: remove empty line printk from dump_zones
Remove the final printk. All preceding output is already properly
newline-terminated and the printk isn't even KERN_CONT to begin with,
so it only adds one empty line to the log.

Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-25 08:43:58 -07:00
Ahmed Samy 6545b60baa dm cache metadata: fix cmd_read_lock() acquiring write lock
Commit 9567366fef ("dm cache metadata: fix READ_LOCK macros and
cleanup WRITE_LOCK macros") uses down_write() instead of down_read() in
cmd_read_lock(), yet up_read() is used to release the lock in
READ_UNLOCK().  Fix it.

Fixes: 9567366fef ("dm cache metadata: fix READ_LOCK macros and cleanup WRITE_LOCK macros")
Cc: stable@vger.kernel.org
Signed-off-by: Ahmed Samy <f.fallen45@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-04-17 11:24:46 -04:00
Mike Snitzer 9567366fef dm cache metadata: fix READ_LOCK macros and cleanup WRITE_LOCK macros
The READ_LOCK macro was incorrectly returning -EINVAL if
dm_bm_is_read_only() was true -- it will always be true once the cache
metadata transitions to read-only by dm_cache_metadata_set_read_only().

Wrap READ_LOCK and WRITE_LOCK multi-statement macros in do {} while(0).
Also, all accesses of the 'cmd' argument passed to these related macros
are now encapsulated in parenthesis.

A follow-up patch can be developed to eliminate the use of macros in
favor of pure C code.  Avoiding that now given that this needs to apply
to stable@.

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: d14fcf3dd7 ("dm cache: make sure every metadata function checks fail_io")
Cc: stable@vger.kernel.org
2016-04-14 17:34:49 -04:00
Dan Carpenter 7dedd15dd2 md/raid0: fix uninitialized variable bug
If this function fails the callers expect that *private_conf is set to
an ERR_PTR() but that isn't true for the first error path where we can't
allocate "conf".  It leads to some uninitialized variable bugs.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-14 09:57:59 -07:00
Jens Axboe c888a8f95a block: kill off q->flush_flags
Now that we converted everything to the newer block write cache
interface, kill off the queue flush_flags and queueable flush
entries.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-13 13:33:19 -06:00
Jens Axboe 56883a7ec8 md: update to using blk_queue_write_cache()
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-04-12 16:00:39 -06:00
Jens Axboe 519a7e16f9 dm: switch to using blk_queue_write_cache()
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-04-12 16:00:39 -06:00
Jens Axboe 84b4ff9ef2 bcache: switch to using blk_queue_write_cache()
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-04-12 16:00:39 -06:00
Mikulas Patocka 072623de1f dm: fix dm_target_io leak if clone_bio() returns an error
Commit c80914e81e ("dm: return error if bio_integrity_clone() fails
in clone_bio()") changed clone_bio() such that if it does return error
then the alloc_tio() created resources (both the bio that was allocated
to be a clone and the containing dm_target_io struct) will leak.

Fix this by calling free_tio() in __clone_and_map_data_bio()'s
clone_bio() error path.

Fixes: c80914e81e ("dm: return error if bio_integrity_clone() fails in clone_bio()")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-04-11 11:49:09 -04:00
Linus Torvalds 63b106a87d Merge tag 'md/4.6-rc2-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD fixes from Shaohua Li:
 "This update mainly fixes bugs:

   - fix error handling (Guoqing)
   - fix a crash when a disk is hotremoved (me)
   - fix a dead loop (Wei Fang)"

* tag 'md/4.6-rc2-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md/bitmap: clear bitmap if bitmap_create failed
  MD: add rdev reference for super write
  md: fix a trivial typo in comments
  md:raid1: fix a dead loop when read from a WriteMostly disk
2016-04-09 11:23:27 -07:00