If the size of the underlying device changes, change the size of the
integrity device too.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Move code to a new function get_provided_data_sectors().
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Following commits will make it possible to shrink or extend the device. If
the device was shrunk, we don't want to replay journal data pointing past
the end of the device.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Since the commit 72deb455b5 ("block:
remove CONFIG_LBDAF") sector_t is always defined as unsigned long
long.
Delete the needless type casts in printk and avoids some warnings if
DEBUG_PRINT is defined.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the user specifies tag size larger than HASH_MAX_DIGESTSIZE,
there's a crash in integrity_metadata().
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
zmd->nr_rnd_zones was increased twice by mistake. The other place it
is increased in dmz_init_zone() is the only one needed:
1131 zmd->nr_useable_zones++;
1132 if (dmz_is_rnd(zone)) {
1133 zmd->nr_rnd_zones++;
^^^
Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If we write a superblock in writecache_flush, we don't need to set bit and
scan the bitmap for it - we can just write the superblock directly. Also,
we can set the flag REQ_FUA on the write bio, so that we don't need to
submit a flush bio afterwards.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a block is stored in the cache for too long, it will now be
written to the underlying device and cleaned up.
Add a new option "max_age" that specifies the maximum age of a block
in milliseconds.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The "flush" or "flush_on_suspend" messages flush the whole cache. However,
these flushing methods can take some time and the process is left in
an interruptible state during the flush.
Implement a "cleaner" option that offers an alternate flushing method.
When this option is activated (either by a message or in the constructor
arguments), the cache will not promote new writes (however, writes to
already cached blocks are promoted, to avoid data corruption due to
misordered writes) and it will gradually writeback any cached data. The
userspace can then monitor the cleaning process with "dmsetup status".
When the number of cached bloks drops to zero, the userspace can unload
the dm-writecache target and replace it with dm-linear or other targets.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the cache device is full, we do a direct write to the origin device.
Note that we must not do it if the written block is already in the cache.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Similar to f710126cfc ("dm crypt: print
device name in integrity error message"), this message should also
better identify the device with the integrity failure.
Signed-off-by: Erich Eckner <git@eckner.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Replace test_bit(CRYPT_MODE_INTEGRITY_AEAD, XXX) with
crypt_integrity_aead().
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add a new include/linux/raid/detect.h header to declare the
md_autodetect_dev prototype which can be shared between md and
the partition code. Then use IS_BUILTIN to call it instead of the
ifdef magic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no good reason for __bdevname to exist. Just open code
printing the string in the callers. For three of them the format
string can be trivially merged into existing printk statements,
and in init/do_mounts.c we can at least do the scnprintf once at
the start of the function, and unconditional of CONFIG_BLOCK to
make the output for tiny configfs a little more helpful.
Acked-by: Theodore Ts'o <tytso@mit.edu> # for ext4
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The idea of this patch is from Davidlohr Bueso, he posts a patch
for bcache to optimize barrier usage for read-modify-write atomic
bitops. Indeed such optimization can also apply on other locations
where smp_mb() is used before or after an atomic operation.
This patch replaces smp_mb() with smp_mb__before_atomic() or
smp_mb__after_atomic() in btree.c and writeback.c, where it is used
to synchronize memory cache just earlier on other cores. Although
the locations are not on hot code path, it is always not bad to mkae
things a little better.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can avoid the unnecessary barrier on non LL/SC architectures,
such as x86. Instead, use the smp_mb__after_atomic().
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since snprintf() returns the would-be-output size instead of the
actual output size, the succeeding calls may go beyond the given
buffer limit. Fix it by replacing with scnprintf().
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When attaching a cached device (a.k.a backing device) to a cache
device, bch_sectors_dirty_init() is called to count dirty sectors
and stripes (see what bcache_dev_sectors_dirty_add() does) on the
cache device.
The counting is done by a single thread recursive function
bch_btree_map_keys() to iterate all the bcache btree nodes.
If the btree has huge number of nodes, bch_sectors_dirty_init() will
take quite long time. In my testing, if the registering cache set has
a existed UUID which matches a already registered cached device, the
automatical attachment during the registration may take more than
55 minutes. This is too long for waiting the bcache to work in real
deployment.
Fortunately when bch_sectors_dirty_init() is called, no other thread
will access the btree yet, it is safe to do a read-only parallelized
dirty sectors counting by multiple threads.
This patch tries to create multiple threads, and each thread tries to
one-by-one count dirty sectors from the sub-tree indexed by a root
node key which the thread fetched. After the sub-tree is counted, the
counting thread will continue to fetch another root node key, until
the fetched key is NULL. How many threads in parallel depends on
the number of keys from the btree root node, and the number of online
CPU core. The thread number will be the less number but no more than
BCH_DIRTY_INIT_THRD_MAX. If there are only 2 keys in root node, it
can only be 2x times faster by this patch. But if there are 10 keys
in the root node, with this patch it can be 10x times faster.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When registering a cache device, bch_btree_check() is called to check
all btree nodes, to make sure the btree is consistent and not
corrupted.
bch_btree_check() is recursively executed in a single thread, when there
are a lot of data cached and the btree is huge, it may take very long
time to check all the btree nodes. In my testing, I observed it took
around 50 minutes to finish bch_btree_check().
When checking the bcache btree nodes, the cache set is not running yet,
and indeed the whole tree is in read-only state, it is safe to create
multiple threads to check the btree in parallel.
This patch tries to create multiple threads, and each thread tries to
one-by-one check the sub-tree indexed by a key from the btree root node.
The parallel thread number depends on how many keys in the btree root
node. At most BCH_BTR_CHKTHREAD_MAX (64) threads can be created, but in
practice is should be min(cpu-number/2, root-node-keys-number).
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch changes macro btree_root() and btree() to bcache_btree_root()
and bcache_btree(), to avoid potential generic name clash in future.
NOTE: for product kernel maintainers, this patch can be skipped if
you feel the rename stuffs introduce inconvenince to patch backport.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In order to accelerate bcache registration speed, the macro btree()
and btree_root() will be referenced out of btree.c. This patch moves
them from btree.c into btree.h with other relative function declaration
in btree.h, for the following changes.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't call quiesce(1) and quiesce(0) if array is already suspended,
otherwise in level_store, the array is writable after mddev_detach
in below part though the intention is to make array writable after
resume.
mddev_suspend(mddev);
mddev_detach(mddev);
...
mddev_resume(mddev);
And it also causes calltrace as follows in [1].
[48005.653834] WARNING: CPU: 1 PID: 45380 at kernel/kthread.c:510 kthread_park+0x77/0x90
[...]
[48005.653976] CPU: 1 PID: 45380 Comm: mdadm Tainted: G OE 5.4.10-arch1-1 #1
[48005.653979] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4105-ITX, BIOS P1.40 08/06/2018
[48005.653984] RIP: 0010:kthread_park+0x77/0x90
[48005.654015] Call Trace:
[48005.654039] r5l_quiesce+0x3c/0x70 [raid456]
[48005.654052] raid5_quiesce+0x228/0x2e0 [raid456]
[48005.654073] mddev_detach+0x30/0x70 [md_mod]
[48005.654090] level_store+0x202/0x670 [md_mod]
[48005.654099] ? security_capable+0x40/0x60
[48005.654114] md_attr_store+0x7b/0xc0 [md_mod]
[48005.654123] kernfs_fop_write+0xce/0x1b0
[48005.654132] vfs_write+0xb6/0x1a0
[48005.654138] ksys_write+0x67/0xe0
[48005.654146] do_syscall_64+0x4e/0x140
[48005.654155] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[48005.654161] RIP: 0033:0x7fa0c8737497
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=206161
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5j8hwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnjID/4/XVrqtVNUzVoVOtkOyxyesBrJVMHEQEpJ
PZssv835IStw0ENhxQJfGjPaIFc9Ff6PMkeN5KRAlMoEc+NkrJShF3owGf+6Bps7
rxpblPxaw+CJFa31YBDZVjMCvbVkDm40G5SsJh+xzdIjlWz7MppkkMPdrErPwY8V
0vnrIc+mKBKfBMZTwVkycYtp17LVgfXguledoWzxM1y47IW5UasKh8jdzhbu8Hvt
zztdQrigUdb+9XnLGCZIY0JQOyrhJ5zQpZ40FzbvxdYrQZXOoYT8L7iFu/z0Wi7K
p3a+G+B4WowtLYW78me4Uut5RrHq2XOehSypfujanQlpgXPGjS3TdHT3an2T8XPQ
NyGsZsn/eLm3btNbhGUd8vqpQy5EmWhqmwvYk9tFAoSFLiLcvCC624b/TCYPL+gk
3ZiI7mXBMjHnUZ0J/RF6kZWTAZDvr/tE7UZt1f8r1eEr8VDzCNp5Pst+HCVIguYD
g9eWF8oH6wYoj39UKf1k+vW2GjXGFsnfivObaxhyz03sAPXK2wQlzAe/4jZ24XNr
TRtOXh97c3CbLAwdUHehlzzdR3U7h0n2KsmrTC5AGmLABmR79s7BJ0+pexuZituO
LwU8+gpf7AugHTrLg1eNXAmBHW44I1ticXYiWcT4iSPn99kNIhlW+Jb1iTGoiu7n
nXyS3b5SCw==
=xwKl
-----END PGP SIGNATURE-----
Merge tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Here are a few fixes that should go into this release. This contains:
- Revert of a bad bcache patch from this merge window
- Removed unused function (Daniel)
- Fixup for the blktrace fix from Jan from this release (Cengiz)
- Fix of deeper level bfqq overwrite in BFQ (Carlo)"
* tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block:
block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()
blktrace: fix dereference after null check
Revert "bcache: ignore pending signals when creating gc and allocator thread"
block: Remove used kblockd_schedule_work_on()
bdi.
- Extend dm-bio-record to track additional struct bio members needed
by DM integrity target.
- Fix DM core to properly advertise that a device is suspended during
unload (between the presuspend and postsuspend hooks). This change
is a prereq for related DM integrity and DM writecache fixes. It
elevates DM integrity's 'suspending' state tracking to DM core.
- Four stable fixes for DM integrity target.
- Fix crash in DM cache target due to incorrect work item cancelling.
- Fix DM thin metadata lockdep warning that was introduced during 5.6
merge window.
- Fix DM zoned target's chunk work refcounting that regressed during
recent conversion to refcount_t.
- Bump the minor version for DM core and all target versions that have
seen interface changes or important fixes during the 5.6 cycle.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl5fwPgTHHNuaXR6ZXJA
cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWnUJB/9+36J0Y3HYdUIKM8P5CBH0tdkMqSo4
BFm3Og4Xo1GA5xodNptgD+QoLXy3VWkekUsLvaLgOlPq7haPvHkME20vUMuO1l46
QMNR2VXciYGgV0+CFlpXL2oMr0ZEc1hkt7ZpzYaw1OIk6Mo0tEEDrSo0rvmXafPf
W4veTWRL4HfN8fy3NpQNJ4xBQs4Iw4VC20FPFIOvzA1EGk7VGGaP1mGKmOnjxo/O
o3lEH8NpQgwxwST7d4T6DPdeu3aTifnjdY8/h8mFGpQnxOiZ/wZkheKWwTCNE8r9
oeVY07qVxNAHyC8G/SmAL5KnC6En1hq5jA2M4xh9guUI2k8YbON571n+
=iAS0
-----END PGP SIGNATURE-----
Merge tag 'for-5.6/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- Fix request-based DM's congestion_fn and actually wire it up to the
bdi.
- Extend dm-bio-record to track additional struct bio members needed by
DM integrity target.
- Fix DM core to properly advertise that a device is suspended during
unload (between the presuspend and postsuspend hooks). This change is
a prereq for related DM integrity and DM writecache fixes. It
elevates DM integrity's 'suspending' state tracking to DM core.
- Four stable fixes for DM integrity target.
- Fix crash in DM cache target due to incorrect work item cancelling.
- Fix DM thin metadata lockdep warning that was introduced during 5.6
merge window.
- Fix DM zoned target's chunk work refcounting that regressed during
recent conversion to refcount_t.
- Bump the minor version for DM core and all target versions that have
seen interface changes or important fixes during the 5.6 cycle.
* tag 'for-5.6/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: bump version of core and various targets
dm: fix congested_fn for request-based device
dm integrity: use dm_bio_record and dm_bio_restore
dm bio record: save/restore bi_end_io and bi_integrity
dm zoned: Fix reference counter initial value of chunk works
dm writecache: verify watermark during resume
dm: report suspended device during destroy
dm thin metadata: fix lockdep complaint
dm cache: fix a crash due to incorrect work item cancelling
dm integrity: fix invalid table returned due to argument count mismatch
dm integrity: fix a deadlock due to offloading to an incorrect workqueue
dm integrity: fix recalculation when moving from journal mode to bitmap mode
Changes made during the 5.6 cycle warrant bumping the version number
for DM core and the targets modified by this commit.
It should be noted that dm-thin, dm-crypt and dm-raid already had
their target version bumped during the 5.6 merge window.
Signed-off-by; Mike Snitzer <snitzer@redhat.com>
We neither assign congested_fn for requested-based blk-mq device nor
implement it correctly. So fix both.
Also, remove incorrect comment from dm_init_normal_md_queue and rename
it to dm_init_congested_fn.
Fixes: 4aa9c692e0 ("bdi: separate out congested state into a separate struct")
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
In cases where dec_in_flight() has to requeue the integrity_bio_wait
work to transfer the rest of the data, the bio's __bi_remaining might
already have been decremented to 0, e.g.: if bio passed to underlying
data device was split via blk_queue_split().
Use dm_bio_{record,restore} rather than effectively open-coding them in
dm-integrity -- these methods now manage __bi_remaining too.
Depends-on: f7f0b057a9c1 ("dm bio record: save/restore bi_end_io and bi_integrity")
Reported-by: Daniel Glöckner <dg@emlix.com>
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Also, save/restore __bi_remaining in case the bio was used in a
BIO_CHAIN (e.g. due to blk_queue_split).
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This reverts commit 0b96da639a.
We can't just go flushing random signals, under the assumption that the
OOM killer will just do something else. It's not safe from the OOM
perspective, and it could also cause other signals to get randomly lost.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dm-zoned initializes reference counters of new chunk works with zero
value and refcount_inc() is called to increment the counter. However, the
refcount_inc() function handles the addition to zero value as an error
and triggers the warning as follows:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 7 PID: 1506 at lib/refcount.c:25 refcount_warn_saturate+0x68/0xf0
...
CPU: 7 PID: 1506 Comm: systemd-udevd Not tainted 5.4.0+ #134
...
Call Trace:
dmz_map+0x2d2/0x350 [dm_zoned]
__map_bio+0x42/0x1a0
__split_and_process_non_flush+0x14a/0x1b0
__split_and_process_bio+0x83/0x240
? kmem_cache_alloc+0x165/0x220
dm_process_bio+0x90/0x230
? generic_make_request_checks+0x2e7/0x680
dm_make_request+0x3e/0xb0
generic_make_request+0xcf/0x320
? memcg_drain_all_list_lrus+0x1c0/0x1c0
submit_bio+0x3c/0x160
? guard_bio_eod+0x2c/0x130
mpage_readpages+0x182/0x1d0
? bdev_evict_inode+0xf0/0xf0
read_pages+0x6b/0x1b0
__do_page_cache_readahead+0x1ba/0x1d0
force_page_cache_readahead+0x93/0x100
generic_file_read_iter+0x83a/0xe40
? __seccomp_filter+0x7b/0x670
new_sync_read+0x12a/0x1c0
vfs_read+0x9d/0x150
ksys_read+0x5f/0xe0
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9
...
After this warning, following refcount API calls for the counter all fail
to change the counter value.
Fix this by setting the initial reference counter value not zero but one
for the new chunk works. Instead, do not call refcount_inc() via
dmz_get_chunk_work() for the new chunks works.
The failure was observed with linux version 5.4 with CONFIG_REFCOUNT_FULL
enabled. Refcount rework was merged to linux version 5.5 by the
commit 168829ad09 ("Merge branch 'locking-core-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip"). After this
commit, CONFIG_REFCOUNT_FULL was removed and the failure was observed
regardless of kernel configuration.
Linux version 4.20 merged the commit 092b564876 ("dm zoned: target: use
refcount_t for dm zoned reference counters"). Before this commit, dm
zoned used atomic_t APIs which does not check addition to zero, then this
fix is not necessary.
Fixes: 092b564876 ("dm zoned: target: use refcount_t for dm zoned reference counters")
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Verify the watermark upon resume - so that if the target is reloaded
with lower watermark, it will start the cleanup process immediately.
Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The function dm_suspended returns true if the target is suspended.
However, when the target is being suspended during unload, it returns
false.
An example where this is a problem: the test "!dm_suspended(wc->ti)" in
writecache_writeback is not sufficient, because dm_suspended returns
zero while writecache_suspend is in progress. As is, without an
enhanced dm_suspended, simply switching from flush_workqueue to
drain_workqueue still emits warnings:
workqueue writecache-writeback: drain_workqueue() isn't complete after 10 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 100 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 200 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 300 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 400 tries
writecache_suspend calls flush_workqueue(wc->writeback_wq) - this function
flushes the current work. However, the workqueue may re-queue itself and
flush_workqueue doesn't wait for re-queued works to finish. Because of
this - the function writecache_writeback continues execution after the
device was suspended and then concurrently with writecache_dtr, causing
a crash in writecache_writeback.
We must use drain_workqueue - that waits until the work and all re-queued
works finish.
As a prereq for switching to drain_workqueue, this commit fixes
dm_suspended to return true after the presuspend hook and before the
postsuspend hook - just like during a normal suspend. It allows
simplifying the dm-integrity and dm-writecache targets so that they
don't have to maintain suspended flags on their own.
With this change use of drain_workqueue() can be used effectively. This
change was tested with the lvm2 testsuite and cryptsetup testsuite and
the are no regressions.
Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Reported-by: Corey Marthaler <cmarthal@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[ 3934.173244] ======================================================
[ 3934.179572] WARNING: possible circular locking dependency detected
[ 3934.185884] 5.4.21-xfstests #1 Not tainted
[ 3934.190151] ------------------------------------------------------
[ 3934.196673] dmsetup/8897 is trying to acquire lock:
[ 3934.201688] ffffffffbce82b18 (shrinker_rwsem){++++}, at: unregister_shrinker+0x22/0x80
[ 3934.210268]
but task is already holding lock:
[ 3934.216489] ffff92a10cc5e1d0 (&pmd->root_lock){++++}, at: dm_pool_metadata_close+0xba/0x120
[ 3934.225083]
which lock already depends on the new lock.
[ 3934.564165] Chain exists of:
shrinker_rwsem --> &journal->j_checkpoint_mutex --> &pmd->root_lock
For a more detailed lockdep report, please see:
https://lore.kernel.org/r/20200220234519.GA620489@mit.edu
We shouldn't need to hold the lock while are just tearing down and
freeing the whole metadata pool structure.
Fixes: 44d8ebf436 ("dm thin metadata: use pool locking at end of dm_pool_metadata_close")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The crash can be reproduced by running the lvm2 testsuite test
lvconvert-thin-external-cache.sh for several minutes, e.g.:
while :; do make check T=shell/lvconvert-thin-external-cache.sh; done
The crash happens in this call chain:
do_waker -> policy_tick -> smq_tick -> end_hotspot_period -> clear_bitset
-> memset -> __memset -- which accesses an invalid pointer in the vmalloc
area.
The work entry on the workqueue is executed even after the bitmap was
freed. The problem is that cancel_delayed_work doesn't wait for the
running work item to finish, so the work item can continue running and
re-submitting itself even after cache_postsuspend. In order to make sure
that the work item won't be running, we must use cancel_delayed_work_sync.
Also, change flush_workqueue to drain_workqueue, so that if some work item
submits itself or another work item, we are properly waiting for both of
them.
Fixes: c6b4fcbad0 ("dm: add cache target")
Cc: stable@vger.kernel.org # v3.9
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the flag SB_FLAG_RECALCULATE is present in the superblock, but it was
not specified on the command line (i.e. ic->recalculate_flag is false),
dm-integrity would return invalid table line - the reported number of
arguments would not match the real number.
Fixes: 468dfca38b ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If we need to perform synchronous I/O in dm_integrity_map_continue(),
we must make sure that we are not in the map function - in order to
avoid the deadlock due to bio queuing in generic_make_request. To
avoid the deadlock, we offload the request to metadata_wq.
However, metadata_wq also processes metadata updates for write requests.
If there are too many requests that get offloaded to metadata_wq at the
beginning of dm_integrity_map_continue, the workqueue metadata_wq
becomes clogged and the system is incapable of processing any metadata
updates.
This causes a deadlock because all the requests that need to do metadata
updates wait for metadata_wq to proceed and metadata_wq waits inside
wait_and_add_new_range until some existing request releases its range
lock (which doesn't happen because the range lock is released after
metadata update).
In order to fix the deadlock, we create a new workqueue offload_wq and
offload requests to it - so that processing of offload_wq is independent
from processing of metadata_wq.
Fixes: 7eada909bf ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If we resume a device in bitmap mode and the on-disk format is in journal
mode, we must recalculate anything above ic->sb->recalc_sector. Otherwise,
there would be non-recalculated blocks which would cause I/O errors.
Fixes: 468dfca38b ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5JdgMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpg6eEAC7/f/ATcrAbQUjx8vsNS0UgtYB1LPdgga6
7MwYKnrzdWWZICOxOgRR4W//wXFRf3vUtmzF5MgGpzJzgeKuyv09Vz1PLcxkAcny
hu9NTQWu6xtxEGiYlfaSW7EQBRdb1876kzOyV+hTrkvQZ9CXcIAQHwjQPKqjqdlb
/j3l5GSjyO0npvsCqIWrsNoeSwfDOhFi+I3hHZutt3T0fPnjo6DGtXN7m4jhWzW/
U3S4yHxmLVDKzorDDMoTV2D8UGrpjQmRXD78QOpOJKO7ngr9OT69dcMH6TnDkUDb
a7O5l5k5DB+0QOQWk5IHJpMRRp3NdVHgnZhTJZaho8kuKoTLwteJFAprJTzHuuvC
BkTBYf1Ref9KT5YfaUcWI+rmq32LgRq8rRaJBF+vqGcOwJw6WRIuygGyZ8eJMItb
oOhsFEOKKDAILncxg5C61v2Sh4g+/YpQojyVCzJSb7UNjOMtDPgjEp2dDSa+/p84
detTqmlt5ZPYHFvsp/EHuGB6ohscsvKzZk+wlQvj4Wq3T7agSp9uljfiTmUdsmWn
69fs66HBvd3CNoOauSVXvI+rhdsgTMX0ptnjIiPYwE0RQtxptp+rPjOfuT4JdboM
AU8f0VKeer1kRPxVbka9YwgVrUw5JqgFkPu2aC9B+ao+4IAM96n6lJje/rC59zkf
eBclcnOlsQ==
=jy10
-----END PGP SIGNATURE-----
Merge tag 'block-5.6-2020-02-16' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Not a lot here, which is great, basically just three small bcache
fixes from Coly, and four NVMe fixes via Keith"
* tag 'block-5.6-2020-02-16' of git://git.kernel.dk/linux-block:
nvme: fix the parameter order for nvme_get_log in nvme_get_fw_slot_info
nvme/pci: move cqe check after device shutdown
nvme: prevent warning triggered by nvme_stop_keep_alive
nvme/tcp: fix bug on double requeue when send fails
bcache: remove macro nr_to_fifo_front()
bcache: Revert "bcache: shrink btree node cache after bch_btree_check()"
bcache: ignore pending signals when creating gc and allocator thread
Macro nr_to_fifo_front() is only used once in btree_flush_write(),
it is unncessary indeed. This patch removes this macro and does
calculation directly in place.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 1df3877ff6.
In my testing, sometimes even all the cached btree nodes are freed,
creating gc and allocator kernel threads may still fail. Finally it
turns out that kthread_run() may fail if there is pending signal for
current task. And the pending signal is sent from OOM killer which
is triggered by memory consuption in bch_btree_check().
Therefore explicitly shrinking bcache btree node here does not help,
and after the shrinker callback is improved, as well as pending signals
are ignored before creating kernel threads, now such operation is
unncessary anymore.
This patch reverts the commit 1df3877ff6 ("bcache: shrink btree node
cache after bch_btree_check()") because we have better improvement now.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When run a cache set, all the bcache btree node of this cache set will
be checked by bch_btree_check(). If the bcache btree is very large,
iterating all the btree nodes will occupy too much system memory and
the bcache registering process might be selected and killed by system
OOM killer. kthread_run() will fail if current process has pending
signal, therefore the kthread creating in run_cache_set() for gc and
allocator kernel threads are very probably failed for a very large
bcache btree.
Indeed such OOM is safe and the registering process will exit after
the registration done. Therefore this patch flushes pending signals
during the cache set start up, specificly in bch_cache_allocator_start()
and bch_gc_thread_start(), to make sure run_cache_set() won't fail for
large cahced data set.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull misc vfs updates from Al Viro:
- bmap series from cmaiolino
- getting rid of convolutions in copy_mount_options() (use a couple of
copy_from_user() instead of the __get_user() crap)
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
saner copy_mount_options()
fibmap: Reject negative block numbers
fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
ecryptfs: drop direct calls to ->bmap
cachefiles: drop direct usage of ->bmap method.
fs: Enable bmap() function to properly return errors
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl47ML4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvm2EACGaxAxP7pLniNV30cRotF8lPpQ5nUrpiem
H1r5WqeI5osCGkRKHaJQ4O0Sw8IV2pWzHTWz+9bv56zLM40yIMaEHLRU00AM047n
KFdA2x4xH+HhbR9lF+flYz1oInlIEXxPiERKm/p1pvQEbquzi4X5cQqv6q2pdzJ9
sf8OBJhKs4rp/ooqzWwjVOeP/n1sT2r+XDg9C9WC5aXaVZbbLw50r1WRYFt1zf7N
oa+91fq2lasxK1c79OtbbGJlBXWTurAtUaKBM0KKPguiH2h9j47pAs0HsV02kZ2M
1ZltwKTyfDNMzBEgvkdB3R0G9nU422nIF+w319i6on8P8xfz8Px13d1KCQGAmfD6
K1YuaCgOjWuVhOKpMwBq9ql6QVP+1LIMKIl2OGJkrBgl9ZzfE8KMZa2QZTGrGO/U
xE/hirYdj5T1O8umUQ4cmZHTROASOJZ8/eU9XHA1vf/eJYXiS31/4ewgRzP3oGX2
5Jvz3o144nBeBTOiFlzs3Fe+wX63QABNG22bijzEGoNTxjXJFroBDYzeiOELjECZ
/xGRZG1bLOGMj8Gg4ZADSILQDkqISsQHofl1I9mWTbwB1j7g69ZjV8Ie2dyMaX6b
5z5Smqzd9gcok9hr8NGWkV3c3NypPxIWxrOcyzYbGLUPDGqa+QjGtlLrGgeinhLM
SitalHw0KA==
=05d8
-----END PGP SIGNATURE-----
Merge tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
"Some later arrivals, but all fixes at this point:
- bcache fix series (Coly)
- Series of BFQ fixes (Paolo)
- NVMe pull request from Keith with a few minor NVMe fixes
- Various little tweaks"
* tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block: (23 commits)
nvmet: update AEN list and array at one place
nvmet: Fix controller use after free
nvmet: Fix error print message at nvmet_install_queue function
brd: check and limit max_part par
nvme-pci: remove nvmeq->tags
nvmet: fix dsm failure when payload does not match sgl descriptor
nvmet: Pass lockdep expression to RCU lists
block, bfq: clarify the goal of bfq_split_bfqq()
block, bfq: get a ref to a group when adding it to a service tree
block, bfq: remove ifdefs from around gets/puts of bfq groups
block, bfq: extend incomplete name of field on_st
block, bfq: get extra ref to prevent a queue from being freed during a group move
block, bfq: do not insert oom queue into position tree
block, bfq: do not plug I/O for bfq_queues with no proc refs
bcache: check return value of prio_read()
bcache: fix incorrect data type usage in btree_flush_write()
bcache: add readahead cache policy options via sysfs interface
bcache: explicity type cast in bset_bkey_last()
bcache: fix memory corruption in bch_cache_accounting_clear()
xen/blkfront: limit allocated memory size to actual use case
...
By now, bmap() will either return the physical block number related to
the requested file offset or 0 in case of error or the requested offset
maps into a hole.
This patch makes the needed changes to enable bmap() to proper return
errors, using the return value as an error return, and now, a pointer
must be passed to bmap() to be filled with the mapped physical block.
It will change the behavior of bmap() on return:
- negative value in case of error
- zero on success or map fell into a hole
In case of a hole, the *block will be zero too
Since this is a prep patch, by now, the only error return is -EINVAL if
->bmap doesn't exist.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now if prio_read() failed during starting a cache set, we can print
out error message in run_cache_set() and handle the failure properly.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dan Carpenter points out that from commit 2aa8c52938 ("bcache: avoid
unnecessary btree nodes flushing in btree_flush_write()"), there is a
incorrect data type usage which leads to the following static checker
warning:
drivers/md/bcache/journal.c:444 btree_flush_write()
warn: 'ref_nr' unsigned <= 0
drivers/md/bcache/journal.c
422 static void btree_flush_write(struct cache_set *c)
423 {
424 struct btree *b, *t, *btree_nodes[BTREE_FLUSH_NR];
425 unsigned int i, nr, ref_nr;
^^^^^^
426 atomic_t *fifo_front_p, *now_fifo_front_p;
427 size_t mask;
428
429 if (c->journal.btree_flushing)
430 return;
431
432 spin_lock(&c->journal.flush_write_lock);
433 if (c->journal.btree_flushing) {
434 spin_unlock(&c->journal.flush_write_lock);
435 return;
436 }
437 c->journal.btree_flushing = true;
438 spin_unlock(&c->journal.flush_write_lock);
439
440 /* get the oldest journal entry and check its refcount */
441 spin_lock(&c->journal.lock);
442 fifo_front_p = &fifo_front(&c->journal.pin);
443 ref_nr = atomic_read(fifo_front_p);
444 if (ref_nr <= 0) {
^^^^^^^^^^^
Unsigned can't be less than zero.
445 /*
446 * do nothing if no btree node references
447 * the oldest journal entry
448 */
449 spin_unlock(&c->journal.lock);
450 goto out;
451 }
452 spin_unlock(&c->journal.lock);
As the warning information indicates, local varaible ref_nr in unsigned
int type is wrong, which does not matche atomic_read() and the "<= 0"
checking.
This patch fixes the above error by defining local variable ref_nr as
int type.
Fixes: 2aa8c52938 ("bcache: avoid unnecessary btree nodes flushing in btree_flush_write()")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In year 2007 high performance SSD was still expensive, in order to
save more space for real workload or meta data, the readahead I/Os
for non-meta data was bypassed and not cached on SSD.
In now days, SSD price drops a lot and people can find larger size
SSD with more comfortable price. It is unncessary to alway bypass
normal readahead I/Os to save SSD space for now.
This patch adds options for readahead data cache policies via sysfs
file /sys/block/bcache<N>/readahead_cache_policy, the options are,
- "all": cache all readahead data I/Os.
- "meta-only": only cache meta data, and bypass other regular I/Os.
If users want to make bcache continue to only cache readahead request
for metadata and bypass regular data readahead, please set "meta-only"
to this sysfs file. By default, bcache will back to cache all read-
ahead requests now.
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In bset.h, macro bset_bkey_last() is defined as,
bkey_idx((struct bkey *) (i)->d, (i)->keys)
Parameter i can be variable type of data structure, the macro always
works once the type of struct i has member 'd' and 'keys'.
bset_bkey_last() is also used in macro csum_set() to calculate the
checksum of a on-disk data structure. When csum_set() is used to
calculate checksum of on-disk bcache super block, the parameter 'i'
data type is struct cache_sb_disk. Inside struct cache_sb_disk (also in
struct cache_sb) the member keys is __u16 type. But bkey_idx() expects
unsigned int (a 32bit width), so there is problem when sending
parameters via stack to call bkey_idx().
Sparse tool from Intel 0day kbuild system reports this incompatible
problem. bkey_idx() is part of user space API, so the simplest fix is
to cast the (i)->keys to unsigned int type in macro bset_bkey_last().
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
unlikely case that a DM device is created without a DM table and
then accessed due to upper-layer userspace code or user error.
- Fix DM thin-provisioning's metadata_pre_commit_callback to not use
memory after it is free'd. Also refactor code to disallow changing
the thin-pool's data device once in use -- doing so guarantees smae
lifetime of pool's data device relative to the pool metadata.
- Fix DM space maps used by DM thinp and DM cache to avoid reuse of a
already used block. This race was identified with extremely heavy
snapshot use in the context of DM thin provisioning.
- Fix DM raid's table status relative to an active rebuild.
- Fix DM crypt to use GFP_NOIO rather than GFP_NOFS in call to
skcipher_request_alloc(). Also fix benbi IV constructor crash if
used in authenticated mode.
- Add DM crypt support for Elephant diffuser to allow for Bitlocker
compatibility.
- Fix DM verity target to not prefetch hash blocks for data that has
already been verified.
- Fix DM writecache's incorrect flush sequence during commit when in
SSD mode.
- Improve DM writecache's sequential write performance on SSDs.
- Add DM zoned target support for zone sizes smaller than 128MiB.
- Add DM multipath 'queue_if_no_path_timeout_secs' module param to
allow timeout if path isn't reinstated. This allows users a kernel
safety-net against IO hanging indefinitely, due to no active paths,
that has historically only been provided by multipathd userspace.
- Various DM code cleanups to use true/false rather than 1/0, a
variable rename in dm-dust, and fix for a math error in comment for
DM thin metadata's ondisk format.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl4xvEITHHNuaXR6ZXJA
cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWipYCAC+sX8q/XBDRi4WDCTvCRCcBfz9g9ZO
oygimV64oYf08JiDL54Z29T0EjwGR6DcZB0nEyjhl2/lU4bPwd4kLc/VHwjf44ay
oUJWZxp8Az7pIWjQQ5oC09it8gLmDpBdq2Z146tEDgYrERnH8BgDObYm3ihXwi9f
zMoET8rIzNltMUo6jIumjzxPcLbBsRTnC35mE//PZkMiUUI3ucNWuOhD/ICDe/Tu
VQ9rNtJx01xs07bFKT4OYR2Oc7xrtWEOMDKeSFz16j8t28wywLFz3EPXRG7tKM3l
6CKBFdRqPSL6v2Mr1QD8YwJJCPLsCoOQ14aKn2sDPG8EgQMnh7R2g6OT
=mfnM
-----END PGP SIGNATURE-----
Merge tag 'for-5.6/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Fix DM core's potential for q->make_request_fn NULL pointer in the
unlikely case that a DM device is created without a DM table and then
accessed due to upper-layer userspace code or user error.
- Fix DM thin-provisioning's metadata_pre_commit_callback to not use
memory after it is free'd. Also refactor code to disallow changing
the thin-pool's data device once in use -- doing so guarantees smae
lifetime of pool's data device relative to the pool metadata.
- Fix DM space maps used by DM thinp and DM cache to avoid reuse of a
already used block. This race was identified with extremely heavy
snapshot use in the context of DM thin provisioning.
- Fix DM raid's table status relative to an active rebuild.
- Fix DM crypt to use GFP_NOIO rather than GFP_NOFS in call to
skcipher_request_alloc(). Also fix benbi IV constructor crash if used
in authenticated mode.
- Add DM crypt support for Elephant diffuser to allow for Bitlocker
compatibility.
- Fix DM verity target to not prefetch hash blocks for data that has
already been verified.
- Fix DM writecache's incorrect flush sequence during commit when in
SSD mode.
- Improve DM writecache's sequential write performance on SSDs.
- Add DM zoned target support for zone sizes smaller than 128MiB.
- Add DM multipath 'queue_if_no_path_timeout_secs' module param to
allow timeout if path isn't reinstated. This allows users a kernel
safety-net against IO hanging indefinitely, due to no active paths,
that has historically only been provided by multipathd userspace.
- Various DM code cleanups to use true/false rather than 1/0, a
variable rename in dm-dust, and fix for a math error in comment for
DM thin metadata's ondisk format.
* tag 'for-5.6/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (21 commits)
dm: fix potential for q->make_request_fn NULL pointer
dm writecache: improve performance of large linear writes on SSDs
dm mpath: Add timeout mechanism for queue_if_no_path
dm thin: change data device's flush_bio to be member of struct pool
dm thin: don't allow changing data device during thin-pool reload
dm thin: fix use-after-free in metadata_pre_commit_callback
dm thin metadata: use pool locking at end of dm_pool_metadata_close
dm writecache: fix incorrect flush sequence when doing SSD mode commit
dm crypt: fix benbi IV constructor crash if used in authenticated mode
dm crypt: Implement Elephant diffuser for Bitlocker compatibility
dm space map common: fix to ensure new block isn't already in use
dm verity: don't prefetch hash blocks for already-verified data
dm crypt: fix GFP flags passed to skcipher_request_alloc()
dm thin metadata: Fix trivial math error in on-disk format documentation
dm thin metadata: use true/false for bool variable
dm snapshot: use true/false for bool variable
dm bio prison v2: use true/false for bool variable
dm mpath: use true/false for bool variable
dm zoned: support zone sizes smaller than 128MiB
dm raid: table line rebuild status fixes
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4vOrAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgph+HD/9bM9CqjchDitL0NQne+4BwBdoRCcik0z7n
y/CrNIh3tZnkJO0fT9Lz6GD/6iZNU93NUHFMOgzuS+8mR5CwQUkR/xjDvPX8H07F
h+Xl8ZUX6YjbuLmO0sgc9yu3SkMaxjCHfGPl1juZXwH6ERM6MTSkg6O+YwQZnvAB
lLJWaa1oOTQAsbnz7ZVwZ5pDOfkoSirCat2kzPoyfzptcIrUw7vfu4QHdCdNHy63
eT/vcHmj6CqzZWJRfpkaFOY6fnY30Hh9fqAVQvzxHPvm1vM3z7JSw7cY8t+cjXjn
TJ0NQK2QFmGTTa/ZEf3KCB5kbNV0SpOV6Jqz1aBX/cStQez6ygFkPGscPbwy8tsR
vBVDyCMZC42jbt7TuIHNkAI/e+HqSOBgyB8MaWaQfApcbNzTIFp9lltrTcZpaYNZ
J4R6YQGDve+ElUlOAPBbiXRGrd3jmhApP8scbgls05UwZOtDf+KJBCLQYRzw8qrb
J7D7hVugwV0oDhdaUkd4Pt3KYoISsFgIe7HRuKGGmfKyqWiJ5iLH0QVPaEkPAokr
VzzSoex+5xcCSvIiGd1DNzsVD9C2xbyUvifHTa36pYKQ65BogyJBopgYgEYd8ksN
AlmPxJM9j1o85TtV1CAbb2O0827BlLmYLc6BcdD+s0x+FeStdnjwICQooHiitTiI
hEHajSDujQ==
=Us3h
-----END PGP SIGNATURE-----
Merge tag 'for-5.6/drivers-2020-01-27' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
"Like the core side, not a lot of changes here, just two main items:
- Series of patches (via Coly) with fixes for bcache (Coly,
Christoph)
- MD pull request from Song"
* tag 'for-5.6/drivers-2020-01-27' of git://git.kernel.dk/linux-block: (31 commits)
bcache: reap from tail of c->btree_cache in bch_mca_scan()
bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
bcache: remove member accessed from struct btree
bcache: print written and keys in trace_bcache_btree_write
bcache: avoid unnecessary btree nodes flushing in btree_flush_write()
bcache: add code comments for state->pool in __btree_sort()
lib: crc64: include <linux/crc64.h> for 'crc64_be'
bcache: use read_cache_page_gfp to read the superblock
bcache: store a pointer to the on-disk sb in the cache and cached_dev structures
bcache: return a pointer to the on-disk sb from read_super
bcache: transfer the sb_page reference to register_{bdev,cache}
bcache: fix use-after-free in register_bcache()
bcache: properly initialize 'path' and 'err' in register_bcache()
bcache: rework error unwinding in register_bcache
bcache: use a separate data structure for the on-disk super block
bcache: cached_dev_free needs to put the sb page
md/raid1: introduce wait_for_serialization
md/raid1: use bucket based mechanism for IO serialization
md: introduce a new struct for IO serialization
md: don't destroy serial_info_pool if serialize_policy is true
...
Move blk_queue_make_request() to dm.c:alloc_dev() so that
q->make_request_fn is never NULL during the lifetime of a DM device
(even one that is created without a DM table).
Otherwise generic_make_request() will crash simply by doing:
dmsetup create -n test
mount /dev/dm-N /mnt
While at it, move ->congested_data initialization out of
dm.c:alloc_dev() and into the bio-based specific init method.
Reported-by: Stefan Bader <stefan.bader@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1860231
Fixes: ff36ab3458 ("dm: remove request-based logic from make_request_fn wrapper")
Depends-on: c12c9a3c38 ("dm: various cleanups to md->queue initialization code")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When shrink btree node cache from c->btree_cache in bch_mca_scan(),
no matter the selected node is reaped or not, it will be rotated from
the head to the tail of c->btree_cache list. But in bcache journal
code, when flushing the btree nodes with oldest journal entry, btree
nodes are iterated and slected from the tail of c->btree_cache list in
btree_flush_write(). The list_rotate_left() in bch_mca_scan() will
make btree_flush_write() iterate more nodes in c->btree_list in reverse
order.
This patch just reaps the selected btree node cache, and not move it
from the head to the tail of c->btree_cache list. Then bch_mca_scan()
will not mess up c->btree_cache list to btree_flush_write().
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In order to skip the most recently freed btree node cahce, currently
in bch_mca_scan() the first 3 caches in c->btree_cache_freeable list
are skipped when shrinking bcache node caches in bch_mca_scan(). The
related code in bch_mca_scan() is,
737 list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) {
738 if (nr <= 0)
739 goto out;
740
741 if (++i > 3 &&
742 !mca_reap(b, 0, false)) {
lines free cache memory
746 }
747 nr--;
748 }
The problem is, if virtual memory code calls bch_mca_scan() and
the calculated 'nr' is 1 or 2, then in the above loop, nothing will
be shunk. In such case, if slub/slab manager calls bch_mca_scan()
for many times with small scan number, it does not help to shrink
cache memory and just wasts CPU cycles.
This patch just selects btree node caches from tail of the
c->btree_cache_freeable list, then the newly freed host cache can
still be allocated by mca_alloc(), and at least 1 node can be shunk.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The member 'accessed' of struct btree is used in bch_mca_scan() when
shrinking btree node caches. The original idea is, if b->accessed is
set, clean it and look at next btree node cache from c->btree_cache
list, and only shrink the caches whose b->accessed is cleaned. Then
only cold btree node cache will be shrunk.
But when I/O pressure is high, it is very probably that b->accessed
of a btree node cache will be set again in bch_btree_node_get()
before bch_mca_scan() selects it again. Then there is no chance for
bch_mca_scan() to shrink enough memory back to slub or slab system.
This patch removes member accessed from struct btree, then once a
btree node ache is selected, it will be immediately shunk. By this
change, bch_mca_scan() may release btree node cahce more efficiently.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
the commit 91be66e131 ("bcache: performance improvement for
btree_flush_write()") was an effort to flushing btree node with oldest
btree node faster in following methods,
- Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot
of clean btree nodes.
- Take c->btree_cache as a LRU-like list, aggressively flushing all
dirty nodes from tail of c->btree_cache util the btree node with
oldest journal entry is flushed. This is to reduce the time of holding
c->bucket_lock.
Guoju Fang and Shuang Li reported that they observe unexptected extra
write I/Os on cache device after applying the above patch. Guoju Fang
provideed more detailed diagnose information that the aggressive
btree nodes flushing may cause 10x more btree nodes to flush in his
workload. He points out when system memory is large enough to hold all
btree nodes in memory, c->btree_cache is not a LRU-like list any more.
Then the btree node with oldest journal entry is very probably not-
close to the tail of c->btree_cache list. In such situation much more
dirty btree nodes will be aggressively flushed before the target node
is flushed. When slow SATA SSD is used as cache device, such over-
aggressive flushing behavior will cause performance regression.
After spending a lot of time on debug and diagnose, I find the real
condition is more complicated, aggressive flushing dirty btree nodes
from tail of c->btree_cache list is not a good solution.
- When all btree nodes are cached in memory, c->btree_cache is not
a LRU-like list, the btree nodes with oldest journal entry won't
be close to the tail of the list.
- There can be hundreds dirty btree nodes reference the oldest journal
entry, before flushing all the nodes the oldest journal entry cannot
be reclaimed.
When the above two conditions mixed together, a simply flushing from
tail of c->btree_cache list is really NOT a good idea.
Fortunately there is still chance to make btree_flush_write() work
better. Here is how this patch avoids unnecessary btree nodes flushing,
- Only acquire c->journal.lock when getting oldest journal entry of
fifo c->journal.pin. In rested locations check the journal entries
locklessly, so their values can be changed on other cores
in parallel.
- In loop list_for_each_entry_safe_reverse(), checking latest front
point of fifo c->journal.pin. If it is different from the original
point which we get with locking c->journal.lock, it means the oldest
journal entry is reclaim on other cores. At this moment, all selected
dirty nodes recorded in array btree_nodes[] are all flushed and clean
on other CPU cores, it is unncessary to iterate c->btree_cache any
longer. Just quit the list_for_each_entry_safe_reverse() loop and
the following for-loop will skip all the selected clean nodes.
- Find a proper time to quit the list_for_each_entry_safe_reverse()
loop. Check the refcount value of orignial fifo front point, if the
value is larger than selected node number of btree_nodes[], it means
more matching btree nodes should be scanned. Otherwise it means no
more matching btee nodes in rest of c->btree_cache list, the loop
can be quit. If the original oldest journal entry is reclaimed and
fifo front point is updated, the refcount of original fifo front point
will be 0, then the loop will be quit too.
- Not hold c->bucket_lock too long time. c->bucket_lock is also required
for space allocation for cached data, hold it for too long time will
block regular I/O requests. When iterating list c->btree_cache, even
there are a lot of maching btree nodes, in order to not holding
c->bucket_lock for too long time, only BTREE_FLUSH_NR nodes are
selected and to flush in following for-loop.
With this patch, only btree nodes referencing oldest journal entry
are flushed to cache device, no aggressive flushing for unnecessary
btree node any more. And in order to avoid blocking regluar I/O
requests, each time when btree_flush_write() called, at most only
BTREE_FLUSH_NR btree nodes are selected to flush, even there are more
maching btree nodes in list c->btree_cache.
At last, one more thing to explain: Why it is safe to read front point
of c->journal.pin without holding c->journal.lock inside the
list_for_each_entry_safe_reverse() loop ?
Here is my answer: When reading the front point of fifo c->journal.pin,
we don't need to know the exact value of front point, we just want to
check whether the value is different from the original front point
(which is accurate value because we get it while c->jouranl.lock is
held). For such purpose, it works as expected without holding
c->journal.lock. Even the front point is changed on other CPU core and
not updated to local core, and current iterating btree node has
identical journal entry local as original fetched fifo front point, it
is still safe. Because after holding mutex b->write_lock (with memory
barrier) this btree node can be found as clean and skipped, the loop
will quite latter when iterate on next node of list c->btree_cache.
Fixes: 91be66e131 ("bcache: performance improvement for btree_flush_write()")
Reported-by: Guoju Fang <fangguoju@gmail.com>
Reported-by: Shuang Li <psymon@bonuscloud.io>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To explain the pages allocated from mempool state->pool can be
swapped in __btree_sort(), because state->pool is a page pool,
which allocates pages by alloc_pages() indeed.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Avoid a pointless dependency on buffer heads in bcache by simply open
coding reading a single page. Also add a SB_OFFSET define for the
byte offset of the superblock instead of using magic numbers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This allows to properly build the superblock bio including the offset in
the page using the normal bio helpers. This fixes writing the superblock
for page sizes larger than 4k where the sb write bio would need an offset
in the bio_vec.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Returning the properly typed actual data structure insteaf of the
containing struct page will save the callers some work going
forward.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Avoid an extra reference count roundtrip by transferring the sb_page
ownership to the lower level register helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The patch "bcache: rework error unwinding in register_bcache" introduces
a use-after-free regression in register_bcache(). Here are current code,
2510 out_free_path:
2511 kfree(path);
2512 out_module_put:
2513 module_put(THIS_MODULE);
2514 out:
2515 pr_info("error %s: %s", path, err);
2516 return ret;
If some error happens and the above code path is executed, at line 2511
path is released, but referenced at line 2515. Then KASAN reports a use-
after-free error message.
This patch changes line 2515 in the following way to fix the problem,
2515 pr_info("error %s: %s", path?path:"", err);
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Patch "bcache: rework error unwinding in register_bcache" from
Christoph Hellwig changes the local variables 'path' and 'err'
in undefined initial state. If the code in register_bcache() jumps
to label 'out:' or 'out_module_put:' by goto, these two variables
might be reference with undefined value by the following line,
out_module_put:
module_put(THIS_MODULE);
out:
pr_info("error %s: %s", path, err);
return ret;
Therefore this patch initializes these two local variables properly
in register_bcache() to avoid such issue.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split the successful and error return path, and use one goto label for each
resource to unwind. This also fixes some small errors like leaking the
module reference count in the reboot case (which seems entirely harmless)
or printing the wrong warning messages for early failures.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split out an on-disk version struct cache_sb with the proper endianness
annotations. This fixes a fair chunk of sparse warnings, but there are
some left due to the way the checksum is defined.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Same as cache device, the buffer page needs to be put while
freeing cached_dev. Otherwise a page would be leaked every
time a cached_dev is stopped.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When dm-writecache is used with SSD as a cache device, it would submit a
separate bio for each written block. The I/Os would be merged by the disk
scheduler, but this merging degrades performance.
Improve dm-writecache performance by submitting larger bios - this is
possible as long as there is consecutive free space on the cache
device.
Benchmark (arm64 with 64k page size, using /dev/ram0 as a cache device):
fio --bs=512k --iodepth=32 --size=400M --direct=1 \
--filename=/dev/mapper/cache --rw=randwrite --numjobs=1 --name=test
block old new
size MiB/s MiB/s
---------------------
512 181 700
1k 347 1256
2k 644 2020
4k 1183 2759
8k 1852 3333
16k 2469 3509
32k 2974 3670
64k 3404 3810
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.
For exmaple (run this on an architecture with 64k pages):
Mount will fail with this error because it tries to read the superblock using 2-sector
access:
device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
EXT4-fs (dm-0): unable to read superblock
This patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.
Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a configurable timeout mechanism to disable queue_if_no_path without
assistance from userspace multipathd. This reimplements multipathd's
no_path_retry mechanism in kernel space. This is motivated by the
desire to prevent processes from hanging indefinitely waiting for IO
in cases where multipathd might be unable to respond (after a failure
or for whatever reason).
Despite replicating userspace multipathd's policy configuration in
kernel space, it is important to prevent IOs from hanging forever,
waiting for userspace that may be incapable of behaving correctly.
Use of the provided "queue_if_no_path_timeout_secs" dm-multipath
module parameter is optional. This timeout mechanism is disabled by
default (by being set to 0).
Signed-off-by: Anatol Pomazau <anatol@google.com>
Co-developed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
With commit fe64369163c5 ("dm thin: don't allow changing data device
during thin-pool load") it is now possible to re-parent the data
device's flush_bio from the pool_c to pool structure. Doing so offers
improved lifetime guarantees for the flush_bio so that the call to
dm_pool_register_pre_commit_callback can now be done safely from
pool_ctr().
Depends-on: fe64369163c5 ("dm thin: don't allow changing data device during thin-pool load")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The existing code allows changing the data device when the thin-pool
target is reloaded.
This capability is not required and only complicates device lifetime
guarantees. This can cause crashes like the one reported here:
https://bugzilla.redhat.com/show_bug.cgi?id=1788596
where the kernel tries to issue a flush bio located in a structure that
was already freed.
Take the first step to simplifying the thin-pool's data device lifetime
by disallowing changing it. Like the thin-pool's metadata device, the
data device is now set in pool_create() and it cannot be changed for a
given thin-pool.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm-thin uses struct pool to hold the state of the pool. There may be
multiple pool_c's pointing to a given pool, each pool_c represents a
loaded target. pool_c's may be created and destroyed arbitrarily and the
pool contains a reference count of pool_c's pointing to it.
Since commit 694cfe7f31 ("dm thin: Flush data device before
committing metadata") a pointer to pool_c is passed to
dm_pool_register_pre_commit_callback and this function stores it in
pmd->pre_commit_context. If this pool_c is freed, but pool is not
(because there is another pool_c referencing it), we end up in a
situation where pmd->pre_commit_context structure points to freed
pool_c. It causes a crash in metadata_pre_commit_callback.
Fix this by moving the dm_pool_register_pre_commit_callback() from
pool_ctr() to pool_preresume(). This way the in-core thin-pool metadata
is only ever armed with callback data whose lifetime matches the
active thin-pool target.
In should be noted that this fix preserves the ability to load a
thin-pool table that uses a different data block device (that contains
the same data) -- though it is unclear if that capability is still
useful and/or needed.
Fixes: 694cfe7f31 ("dm thin: Flush data device before committing metadata")
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Ensure that the pool is locked during calls to __commit_transaction and
__destroy_persistent_data_objects. Just being consistent with locking,
but reality is dm_pool_metadata_close is called once pool is being
destroyed so access to pool shouldn't be contended.
Also, use pmd_write_lock_in_core rather than __pmd_write_lock in
dm_pool_commit_metadata and rename __pmd_write_lock to
pmd_write_lock_in_core -- there was no need for the alias.
In addition, verify that the pool is locked in __commit_transaction().
Fixes: 873f258bec ("dm thin metadata: do not write metadata if no changes occurred")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When committing state, the function writecache_flush does the following:
1. write metadata (writecache_commit_flushed)
2. flush disk cache (writecache_commit_flushed)
3. wait for data writes to complete (writecache_wait_for_ios)
4. increase superblock seq_count
5. write the superblock
6. flush disk cache
It may happen that at step 3, when we wait for some write to finish, the
disk may report the write as finished, but the write only hit the disk
cache and it is not yet stored in persistent storage. At step 5 we write
the superblock - it may happen that the superblock is written before the
write that we waited for in step 3. If the machine crashes, it may result
in incorrect data being returned after reboot.
In order to fix the bug, we must swap steps 2 and 3 in the above sequence,
so that we first wait for writes to complete and then flush the disk
cache.
Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If benbi IV is used in AEAD construction, for example:
cryptsetup luksFormat <device> --cipher twofish-xts-benbi --key-size 512 --integrity=hmac-sha256
the constructor uses wrong skcipher function and crashes:
BUG: kernel NULL pointer dereference, address: 00000014
...
EIP: crypt_iv_benbi_ctr+0x15/0x70 [dm_crypt]
Call Trace:
? crypt_subkey_size+0x20/0x20 [dm_crypt]
crypt_ctr+0x567/0xfc0 [dm_crypt]
dm_table_add_target+0x15f/0x340 [dm_mod]
Fix this by properly using crypt_aead_blocksize() in this case.
Fixes: ef43aa3806 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
Cc: stable@vger.kernel.org # v4.12+
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941051
Reported-by: Jerad Simpson <jbsimpson@gmail.com>
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add experimental support for BitLocker encryption with CBC mode and
additional Elephant diffuser.
The mode was used in older Windows systems and it is provided mainly
for compatibility reasons. The userspace support to activate these
devices is being added to cryptsetup utility.
Read-write activation of such a device is very simple, for example:
echo <password> | cryptsetup bitlkOpen bitlk_image.img test
The Elephant diffuser uses two rotations in opposite direction for
data (Diffuser A and B) and also XOR operation with Sector key over
the sector data; Sector key is derived from additional key data. The
original public documentation is available here:
http://download.microsoft.com/download/0/2/3/0238acaf-d3bf-4a6d-b3d6-0a0be4bbb36e/bitlockercipher200608.pdf
The dm-crypt implementation is embedded to "elephant" IV (similar to
tcw IV construction).
Because we cannot modify original bio data for write (before
encryption), an additional internal flag to pre-process data is
added.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The space-maps track the reference counts for disk blocks allocated by
both the thin-provisioning and cache targets. There are variants for
tracking metadata blocks and data blocks.
Transactionality is implemented by never touching blocks from the
previous transaction, so we can rollback in the event of a crash.
When allocating a new block we need to ensure the block is free (has
reference count of 0) in both the current and previous transaction.
Prior to this fix we were doing this by searching for a free block in
the previous transaction, and relying on a 'begin' counter to track
where the last allocation in the current transaction was. This
'begin' field was not being updated in all code paths (eg, increment
of a data block reference count due to breaking sharing of a neighbour
block in the same btree leaf).
This fix keeps the 'begin' field, but now it's just a hint to speed up
the search. Instead the current transaction is searched for a free
block, and then the old transaction is double checked to ensure it's
free. Much simpler.
This fixes reports of sm_disk_new_block()'s BUG_ON() triggering when
DM thin-provisioning's snapshots are heavily used.
Reported-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Previously, we call check_and_add_serial when serialization is
enabled for write IO, but it could allocate and free memory
back and forth.
Now, let's just get an element from memory pool with the new
function, then insert node to rb tree if no collision happens.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Since raid1 had already used bucket based mechanism to reduce
the conflict between write IO and resync IO, it is possible to
speed up performance for io serialization with refer to the
same mechanism.
To align with the barrier bucket mechanism, we created arrays
(with the same number of BARRIER_BUCKETS_NR) for spinlock, rb
tree and waitqueue. Then we can reduce lock competition with
multiple spinlocks, boost search performance with multiple rb
trees and also reduce thundering herd problem with multiple
waitqueues.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Obviously, IO serialization could cause the degradation of
performance a lot. In order to reduce the degradation, so a
rb interval tree is added in raid1 to speed up the check of
collision.
So, a rb root is needed in md_rdev, then abstract all the
serialize related members to a new struct (serial_in_rdev),
embed it into md_rdev.
Of course, we need to free the struct if it is not needed
anymore, so rdev/rdevs_uninit_serial are added accordingly.
And they should be called when destroty memory pool or can't
alloc memory.
And we need to consider to call mddev_destroy_serial_pool
in case serialize_policy/write-behind is disabled, bitmap
is destroyed or in __md_stop_writes.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
The serial_info_pool is needed if array sets serialize_policy to
true, so don't destroy it.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Before dispatch write bio, raid1 array which enables
serialize_policy need to check if overlap exists between
this bio and previous on-flying bios. If there is overlap,
then it has to wait until the collision is disappeared.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
So far, IO serialization is used for two scenarios:
1. raid1 which enables write-behind mode, and there is rdev in the array
which is multi-queue device and flaged with writemostly.
2. IO serialization is enabled or disabled by change serialize_policy.
So introduce rdev_need_serial to check the first scenario. And for 1, IO
serialization is enabled automatically while 2 is controlled manually.
And it is possible that both scenarios are true, so for create serial pool,
rdev/rdevs_init_serial should be separate from check if the pool existed or
not. Then for destroy pool, we need to check if the pool is needed by other
rdevs due to the first scenario.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
With the new sysfs node, we can use it to control if raid1 array
wants io serialization or not. So mddev_create_serial_pool and
mddev_destroy_serial_pool are called in serialize_policy_store
to enable or disable the serialization.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
1. The related resources (spin_lock, list and waitqueue) are needed for
address raid1 reorder overlap issue too, in this case, rdev is set to
NULL for mddev_create/destroy_serial_pool which implies all rdevs need
to handle these resources.
And also add "is_suspend" to mddev_destroy_serial_pool since it will
be called under suspended situation, which also makes both create and
destroy pool have same arguments.
2. Introduce rdevs_init_serial which is called if raid1 io serialization
is enabled since all rdevs need to init related stuffs.
3. rdev_init_serial and clear_bit(CollisionCheck, &rdev->flags) should
be called between suspend and resume.
No need to export mddev_create_serial_pool since it is only called in
md-mod module.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
It actually means create here, so fix the typo.
Reported-by: Song Liu <liu.song.a23@gmail.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Previously, wb_info_pool and wb_list stuffs are introduced
to address potential data inconsistence issue for write
behind device.
Now rename them to serial related name, since the same
mechanism will be used to address reorder overlap write
issue for raid1.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
We can use "cnt" directly to update conf->worker_cnt_per_group
if alloc_thread_groups returns 0.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
In md_bitmap_unplug, bitmap->storage.filemap is double checked.
In md_bitmap_daemon_work, bitmap->storage.filemap should be checked
before reference.
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Try to skip prefetching hash blocks that won't be needed due to the
"check_at_most_once" option being enabled and the corresponding data
blocks already having been verified.
Since prefetching operates on a range of data blocks, do this by just
trimming the two ends of the range. This doesn't skip every unneeded
hash block, since data blocks in the middle of the range could also be
unneeded, and hash blocks are still prefetched in large clusters as
controlled by dm_verity_prefetch_cluster. But it can still help a lot.
In a test on Android Q launching 91 apps every 15s repeated 21 times,
prefetching was only done for 447177/4776629 = 9.36% of data blocks.
Tested-by: ruxian.feng <ruxian.feng@transsion.com>
Co-developed-by: yuanjiong.gao <yuanjiong.gao@transsion.com>
Signed-off-by: yuanjiong.gao <yuanjiong.gao@transsion.com>
Signed-off-by: xianrong.zhou <xianrong.zhou@transsion.com>
[EB: simplified the 'while' loops and improved the commit message]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
GFP_KERNEL is not supposed to be or'd with GFP_NOFS (the result is
equivalent to GFP_KERNEL). Also, we use GFP_NOIO instead of GFP_NOFS
because we don't want any I/O being submitted in the direct reclaim
path.
Fixes: 39d13a1ac4 ("dm crypt: reuse eboiv skcipher for IV generation")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes coccicheck warning:
drivers/md/dm-thin-metadata.c:814:3-14: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-thin-metadata.c:1109:1-12: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-thin-metadata.c:1621:1-12: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-thin-metadata.c:1652:1-12: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-thin-metadata.c:1706:1-12: WARNING: Assignment of 0/1 to bool variable
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes coccicheck warning:
drivers/md/dm-snap.c:1064:3-18: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-snap.c:1152:1-16: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-snap.c:1317:1-16: WARNING: Assignment of 0/1 to bool variable
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm-zoned is observed to log failed kernel assertions and not work
correctly when operating against a device with a zone size smaller
than 128MiB (e.g. 32768 bits per 4K block). The reason is that the
bitmap size per zone is calculated as zero with such a small zone
size. Fix this problem and also make the code related to zone bitmap
management be able to handle per zone bitmaps smaller than a single
block.
A dm-zoned-tools patch is required to properly format dm-zoned devices
with zone sizes smaller than 128MiB.
Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
raid_status() wasn't emitting rebuild flags on the table line properly
because the rdev number was not yet set properly; index raid component
devices array directly to solve.
Also fix wrong argument count on emitted table line caused by 1 too
many rebuild/write_mostly argument and consider any journal_(dev|mode)
pairs.
Link: https://bugzilla.redhat.com/1782045
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>