Commit Graph

602 Commits

Author SHA1 Message Date
Gustavo A. R. Silva df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Linus Torvalds 2f12d44085 - DM multipath locking fixes around m->flags tests and improvements to
bio-based code so that it follows patterns established by
   request-based code.
 
 - Request-based DM core improvement to eliminate unnecessary call to
   blk_mq_queue_stopped().
 
 - Add "panic_on_corruption" error handling mode to DM verity target.
 
 - DM bufio fix to to perform buffer cleanup from a workqueue rather
   than wait for IO in reclaim context from shrinker.
 
 - DM crypt improvement to optionally avoid async processing via
   workqueues for reads and/or writes -- via "no_read_workqueue" and
   "no_write_workqueue" features.  This more direct IO processing
   improves latency and throughput with faster storage.  Avoiding
   workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is
   a requirement for adding zoned block device support to DM crypt.
 
 - Add zoned block device support to DM crypt.  Makes use of
   DM_CRYPT_NO_WRITE_WORKQUEUE and a new optional feature
   (DM_CRYPT_WRITE_INLINE) that allows write completion to wait for
   encryption to complete.  This allows write ordering to be preserved,
   which is needed for zoned block devices.
 
 - Fix DM ebs target's check for REQ_OP_FLUSH.
 
 - Fix DM core's report zones support to not report more zones than
   were requested.
 
 - A few small compiler warning fixes.
 
 - DM dust improvements to return output directly to the user rather
   than require they scrape the system log for output.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl8tdOQTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWvDlB/sF8svagDqeqs27xTxCiUPykD29cMmS
 OGPr0Mp/BntZOBpSaTPM9s5XucP3WJhPsxet5qeoyM3OViSFx+O55PqPjn8C65y0
 eGMa4zknd9eO1933+ijmyQu6VNr4sf/6nusX4xSGqv00UR22dJ+3pHtfN9ANDXYX
 AAYA0Ve6UuOwAbGUCnRGI/2780aYY0B8Ok+cF21CskqryF+RpmbZ6BsR07+Hk4cy
 LX5EaHUqezW12cibLq2f0l7TLLJ86OscvqyU9lGVIxiV57e2i5c2S1HvhKZu+Wn3
 6CUmlOhGI0viCKgM1ArekZ+zOw9ROIaAKKPzC5mspqx9yuuCqdY8k8xV
 =X3tt
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - DM multipath locking fixes around m->flags tests and improvements to
   bio-based code so that it follows patterns established by
   request-based code.

 - Request-based DM core improvement to eliminate unnecessary call to
   blk_mq_queue_stopped().

 - Add "panic_on_corruption" error handling mode to DM verity target.

 - DM bufio fix to to perform buffer cleanup from a workqueue rather
   than wait for IO in reclaim context from shrinker.

 - DM crypt improvement to optionally avoid async processing via
   workqueues for reads and/or writes -- via "no_read_workqueue" and
   "no_write_workqueue" features. This more direct IO processing
   improves latency and throughput with faster storage. Avoiding
   workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is a
   requirement for adding zoned block device support to DM crypt.

 - Add zoned block device support to DM crypt. Makes use of
   DM_CRYPT_NO_WRITE_WORKQUEUE and a new optional feature
   (DM_CRYPT_WRITE_INLINE) that allows write completion to wait for
   encryption to complete. This allows write ordering to be preserved,
   which is needed for zoned block devices.

 - Fix DM ebs target's check for REQ_OP_FLUSH.

 - Fix DM core's report zones support to not report more zones than were
   requested.

 - A few small compiler warning fixes.

 - DM dust improvements to return output directly to the user rather
   than require they scrape the system log for output.

* tag 'for-5.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: don't call report zones for more than the user requested
  dm ebs: Fix incorrect checking for REQ_OP_FLUSH
  dm init: Set file local variable static
  dm ioctl: Fix compilation warning
  dm raid: Remove empty if statement
  dm verity: Fix compilation warning
  dm crypt: Enable zoned block device support
  dm crypt: add flags to optionally bypass kcryptd workqueues
  dm bufio: do buffer cleanup from a workqueue
  dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue()
  dm dust: add interface to list all badblocks
  dm dust: report some message results directly back to user
  dm verity: add "panic_on_corruption" error handling mode
  dm mpath: use double checked locking in fast path
  dm mpath: rename current_pgpath to pgpath in multipath_prepare_ioctl
  dm mpath: rework __map_bio()
  dm mpath: factor out multipath_queue_bio
  dm mpath: push locking down to must_push_back_rq()
  dm mpath: take m->lock spinlock when testing QUEUE_IF_NO_PATH
  dm mpath: changes from initial m->flags locking audit
2020-08-07 13:08:09 -07:00
Johannes Thumshirn a9cb9f4148 dm: don't call report zones for more than the user requested
Don't call report zones for more zones than the user actually requested,
otherwise this can lead to out-of-bounds accesses in the callback
functions.

Such a situation can happen if the target's ->report_zones() callback
function returns 0 because we've reached the end of the target and then
restart the report zones on the second target.

We're again calling into ->report_zones() and ultimately into the user
supplied callback function but when we're not subtracting the number of
zones already processed this may lead to out-of-bounds accesses in the
user callbacks.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Fixes: d41003513e ("block: rework zone reporting")
Cc: stable@vger.kernel.org # v5.5+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-04 16:31:12 -04:00
Linus Torvalds 382625d0d4 for-5.9/block-20200802
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7YwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpt+dEAC7a0HYuX2OrkyawBnsgd1QQR/soC7surec
 yDDa7SMM8cOq3935bfzcYHV9FWJszEGIknchiGb9R3/T+vmSohbvDsM5zgwya9u/
 FHUIuTq324I6JWXKl30k4rwjiX9wQeMt+WZ5gC8KJYCWA296i2IpJwd0A45aaKuS
 x4bTjxqknE+fD4gQiMUSt+bmuOUAp81fEku3EPapCRYDPAj8f5uoY7R2arT/POwB
 b+s+AtXqzBymIqx1z0sZ/XcdZKmDuhdurGCWu7BfJFIzw5kQ2Qe3W8rUmrQ3pGut
 8a21YfilhUFiBv+B4wptfrzJuzU6Ps0BXHCnBsQjzvXwq5uFcZH495mM/4E4OJvh
 SbjL2K4iFj+O1ngFkukG/F8tdEM1zKBYy2ZEkGoWKUpyQanbAaGI6QKKJA+DCdBi
 yPEb7yRAa5KfLqMiocm1qCEO1I56HRiNHaJVMqCPOZxLmpXj19Fs71yIRplP1Trv
 GGXdWZsccjuY6OljoXWdEfnxAr5zBsO3Yf2yFT95AD+egtGsU1oOzlqAaU1mtflw
 ABo452pvh6FFpxGXqz6oK4VqY4Et7WgXOiljA4yIGoPpG/08L1Yle4eVc2EE01Jb
 +BL49xNJVeUhGFrvUjPGl9kVMeLmubPFbmgrtipW+VRg9W8+Yirw7DPP6K+gbPAR
 RzAUdZFbWw==
 =abJG
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Good amount of cleanups and tech debt removals in here, and as a
  result, the diffstat shows a nice net reduction in code.

   - Softirq completion cleanups (Christoph)

   - Stop using ->queuedata (Christoph)

   - Cleanup bd claiming (Christoph)

   - Use check_events, moving away from the legacy media change
     (Christoph)

   - Use inode i_blkbits consistently (Christoph)

   - Remove old unused writeback congestion bits (Christoph)

   - Cleanup/unify submission path (Christoph)

   - Use bio_uninit consistently, instead of bio_disassociate_blkg
     (Christoph)

   - sbitmap cleared bits handling (John)

   - Request merging blktrace event addition (Jan)

   - sysfs add/remove race fixes (Luis)

   - blk-mq tag fixes/optimizations (Ming)

   - Duplicate words in comments (Randy)

   - Flush deferral cleanup (Yufen)

   - IO context locking/retry fixes (John)

   - struct_size() usage (Gustavo)

   - blk-iocost fixes (Chengming)

   - blk-cgroup IO stats fixes (Boris)

   - Various little fixes"

* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
  block: blk-timeout: delete duplicated word
  block: blk-mq-sched: delete duplicated word
  block: blk-mq: delete duplicated word
  block: genhd: delete duplicated words
  block: elevator: delete duplicated word and fix typos
  block: bio: delete duplicated words
  block: bfq-iosched: fix duplicated word
  iocost_monitor: start from the oldest usage index
  iocost: Fix check condition of iocg abs_vdebt
  block: Remove callback typedefs for blk_mq_ops
  block: Use non _rcu version of list functions for tag_set_list
  blk-cgroup: show global disk stats in root cgroup io.stat
  blk-cgroup: make iostat functions visible to stat printing
  block: improve discard bio alignment in __blkdev_issue_discard()
  block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
  block: defer flush request no matter whether we have elevator
  block: make blk_timeout_init() static
  block: remove retry loop in ioc_release_fn()
  block: remove unnecessary ioc nested locking
  block: integrate bd_start_claiming into __blkdev_get
  ...
2020-08-03 11:57:03 -07:00
Mikulas Patocka 5df96f2b9f dm integrity: fix integrity recalculation that is improperly skipped
Commit adc0daad36 ("dm: report suspended
device during destroy") broke integrity recalculation.

The problem is dm_suspended() returns true not only during suspend,
but also during resume. So this race condition could occur:
1. dm_integrity_resume calls queue_work(ic->recalc_wq, &ic->recalc_work)
2. integrity_recalc (&ic->recalc_work) preempts the current thread
3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto unlock_ret;
4. integrity_recalc exits and no recalculating is done.

To fix this race condition, add a function dm_post_suspending that is
only true during the postsuspend phase and use it instead of
dm_suspended().

Signed-off-by: Mikulas Patocka <mpatocka redhat com>
Fixes: adc0daad36 ("dm: report suspended device during destroy")
Cc: stable vger kernel org # v4.18+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-23 14:39:37 -04:00
Christoph Hellwig 21cf866145 writeback: remove bdi->congested_fn
Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures.  And
pktdvd is a deprecated driver that isn't useful in stack setup
either.  So remove the dead congested_fn stacking infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 17:20:46 -06:00
Christoph Hellwig 8c911f3d4c writeback: remove struct bdi_writeback_congested
We never set any congested bits in the group writeback instances of it.
And for the simpler bdi-wide case a simple scalar field is all that
that is needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 17:05:53 -06:00
Mikulas Patocka 6958c1c640 dm: use noio when sending kobject event
kobject_uevent may allocate memory and it may be called while there are dm
devices suspended. The allocation may recurse into a suspended device,
causing a deadlock. We must set the noio flag when sending a uevent.

The observed deadlock was reported here:
https://www.redhat.com/archives/dm-devel/2020-March/msg00025.html

Reported-by: Khazhismel Kumykov <khazhy@google.com>
Reported-by: Tahsin Erdogan <tahsin@google.com>
Reported-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-08 12:50:51 -04:00
Christoph Hellwig 382761dc63 dm: use bio_uninit instead of bio_disassociate_blkg
bio_uninit is the proper API to clean up a BIO that has been allocated
on stack or inside a structure that doesn't come from the BIO allocator.
Switch dm to use that instead of bio_disassociate_blkg, which really is
an implementation detail.  Note that the bio_uninit calls are also moved
to the two callers of __send_empty_flush, so that they better pair with
the bio_init calls used to initialize them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-08 11:02:05 -04:00
Ming Lei 85067747cf dm: do not use waitqueue for request-based DM
Given request-based DM now uses blk-mq's blk_mq_queue_inflight() to
determine if outstanding IO has completed (and DM has no control over
the blk-mq state machine used to track outstanding IO) it is unsafe to
wakeup waiter (dm_wait_for_completion) before blk-mq has cleared a
request's state bits (e.g. MQ_RQ_IN_FLIGHT or MQ_RQ_COMPLETE).  As
such dm_wait_for_completion() could be left to wait indefinitely if no
other requests complete.

Fix this by eliminating request-based DM's use of waitqueue to wait
for blk-mq requests to complete in dm_wait_for_completion.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Depends-on: 3c94d83cb3 ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-07 12:00:01 -04:00
Jens Axboe b53ac8b891 dm: remove unused variable
Since merging the commit identified in Fixes below, we trigger this
compile time warning:

drivers/md/dm.c: In function ‘__map_bio’:
drivers/md/dm.c:1296:24: warning: unused variable ‘md’ [-Wunused-variable]
 1296 |  struct mapped_device *md = io->md;
       |                        ^~

Remove the 'md' variable.

Fixes: 5a6c35f9af ("block: remove direct_make_request")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 15:45:19 -06:00
Christoph Hellwig 5a6c35f9af block: remove direct_make_request
Now that submit_bio_noacct has a decent blk-mq fast path there is no
more need for this bypass.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Christoph Hellwig ed00aabd5e block: rename generic_make_request to submit_bio_noacct
generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Christoph Hellwig c62b37d96b block: move ->make_request_fn to struct block_device_operations
The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector.  Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does).  Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Christoph Hellwig f695ca3886 block: remove the request_queue argument from blk_queue_split
The queue can be trivially derived from the bio, so pass one less
argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:23 -06:00
Christoph Hellwig c4a59c4e5d dm: stop using ->queuedata
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:23 -06:00
Christoph Hellwig 4ef2c5c246 dm: use bio_uninit instead of bio_disassociate_blkg
bio_uninit is the proper API to clean up a BIO that has been allocated
on stack or inside a structure that doesn't come from the BIO allocator.
Switch dm to use that instead of bio_disassociate_blkg, which really is
an implementation detail.  Note that the bio_uninit calls are also moved
to the two callers of __send_empty_flush, so that they better pair with
the bio_init calls used to initialize them.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29 09:09:07 -06:00
Johannes Thumshirn 415c79e13b dm: update original bio sector on Zone Append
Naohiro reported that issuing zone-append bios to a zoned block device
underneath a dm-linear device does not work as expected.

This because we forgot to reverse-map the sector the device wrote to the
original bio.

For zone-append bios, get the offset in the zone of the written sector
from the clone bio and add that to the original bio's sector position.

Fixes: 0512a75b98 ("block: Introduce REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-06-19 12:25:58 -04:00
Linus Torvalds b25c6644bf - Largest change for this cycle is the DM zoned target's metadata
version 2 feature that adds support for pairing regular block
   devices with a zoned device to ease performance impact associated
   with finite random zones of zoned device.  Changes came in 3
   batches: first prepared for and then added the ability to pair a
   single regular block device, second was a batch of fixes to improve
   zoned's reclaim heuristic, third removed the limitation of only
   adding a single additional regular block device to allow many
   devices.  Testing has shown linear scaling as more devices are
   added.
 
 - Add new emulated block size (ebs) target that emulates a smaller
   logical_block_size than a block device supports.  Primary use-case
   is to emulate "512e" devices that have 512 byte logical_block_size
   and 4KB physical_block_size.  This is useful to some legacy
   applications otherwise wouldn't be ablee to be used on 4K devices
   because they depend on issuing IO in 512 byte granularity.
 
 - Add discard interfaces to DM bufio.  First consumer of the interface
   is the dm-ebs target that makes heavy use of dm-bufio.
 
 - Fix DM crypt's block queue_limits stacking to not truncate
   logic_block_size.
 
 - Add Documentation for DM integrity's status line.
 
 - Switch DMDEBUG from a compile time config option to instead use
   dynamic debug via pr_debug.
 
 - Fix DM multipath target's hueristic for how it manages
   "queue_if_no_path" state internally.  DM multipath now avoids
   disabling "queue_if_no_path" unless it is actually needed (e.g. in
   response to configure timeout or explicit "fail_if_no_path"
   message).  This fixes reports of spurious -EIO being reported back
   to userspace application during fault tolerance testing with an NVMe
   backend.  Added various dynamic DMDEBUG messages to assist with
   debugging queue_if_no_path in the future.
 
 - Add a new DM multipath "Historical Service Time" Path Selector.
 
 - Fix DM multipath's dm_blk_ioctl() to switch paths on IO error.
 
 - Improve DM writecache target performance by using explicit
   cache flushing for target's single-threaded usecase and a small
   cleanup to remove unnecessary test in persistent_memory_claim.
 
 - Other small cleanups in DM core, dm-persistent-data, and DM integrity.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl7alrgTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWl42B/9sBd+j60emy4Bliu/f3pd7SEkFrSXv
 K2jXicRFx4E5kO0aLK+fX65cOiq2vvLsDh8c++0TLXcD9q7oK0qxK9c8TCPq29Cx
 W2J2dwdjyyqqbr3/FZHYYM9KLOl5rsxJLygXwhJhQ2Gny44L7nVACrAXNzXIHJ3r
 f8xr+GLdF/jz7WLj8bwEDo3Bf8wvxDvl2ijqj7EceOhTutNE8xHQ6UcTPqTtozJy
 sNM8UQNk1L43DBAvXfrKZ+yQ5DYAKdXKJpV9C8qv5DEGbaEikuMrHddgO4KlDdp4
 VjPk9GSfPwGcJ4ecN8vgecZVGvh52ZU7OZ8qey/q0zqps74jeHTZQQyM
 =jRun
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - The largest change for this cycle is the DM zoned target's metadata
   version 2 feature that adds support for pairing regular block devices
   with a zoned device to ease the performance impact associated with
   finite random zones of zoned device.

   The changes came in three batches: the first prepared for and then
   added the ability to pair a single regular block device, the second
   was a batch of fixes to improve zoned's reclaim heuristic, and the
   third removed the limitation of only adding a single additional
   regular block device to allow many devices.

   Testing has shown linear scaling as more devices are added.

 - Add new emulated block size (ebs) target that emulates a smaller
   logical_block_size than a block device supports

   The primary use-case is to emulate "512e" devices that have 512 byte
   logical_block_size and 4KB physical_block_size. This is useful to
   some legacy applications that otherwise wouldn't be able to be used
   on 4K devices because they depend on issuing IO in 512 byte
   granularity.

 - Add discard interfaces to DM bufio. First consumer of the interface
   is the dm-ebs target that makes heavy use of dm-bufio.

 - Fix DM crypt's block queue_limits stacking to not truncate
   logic_block_size.

 - Add Documentation for DM integrity's status line.

 - Switch DMDEBUG from a compile time config option to instead use
   dynamic debug via pr_debug.

 - Fix DM multipath target's hueristic for how it manages
   "queue_if_no_path" state internally.

   DM multipath now avoids disabling "queue_if_no_path" unless it is
   actually needed (e.g. in response to configure timeout or explicit
   "fail_if_no_path" message).

   This fixes reports of spurious -EIO being reported back to userspace
   application during fault tolerance testing with an NVMe backend.
   Added various dynamic DMDEBUG messages to assist with debugging
   queue_if_no_path in the future.

 - Add a new DM multipath "Historical Service Time" Path Selector.

 - Fix DM multipath's dm_blk_ioctl() to switch paths on IO error.

 - Improve DM writecache target performance by using explicit cache
   flushing for target's single-threaded usecase and a small cleanup to
   remove unnecessary test in persistent_memory_claim.

 - Other small cleanups in DM core, dm-persistent-data, and DM
   integrity.

* tag 'for-5.8/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (62 commits)
  dm crypt: avoid truncating the logical block size
  dm mpath: add DM device name to Failing/Reinstating path log messages
  dm mpath: enhance queue_if_no_path debugging
  dm mpath: restrict queue_if_no_path state machine
  dm mpath: simplify __must_push_back
  dm zoned: check superblock location
  dm zoned: prefer full zones for reclaim
  dm zoned: select reclaim zone based on device index
  dm zoned: allocate zone by device index
  dm zoned: support arbitrary number of devices
  dm zoned: move random and sequential zones into struct dmz_dev
  dm zoned: per-device reclaim
  dm zoned: add metadata pointer to struct dmz_dev
  dm zoned: add device pointer to struct dm_zone
  dm zoned: allocate temporary superblock for tertiary devices
  dm zoned: convert to xarray
  dm zoned: add a 'reserved' zone flag
  dm zoned: improve logging messages for reclaim
  dm zoned: avoid unnecessary device recalulation for secondary superblock
  dm zoned: add debugging message for reading superblocks
  ...
2020-06-05 15:45:03 -07:00
Christoph Hellwig 86240d5b68 dm: use bio_{start,end}_io_acct
Switch dm to use the nicer bio accounting helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00
Mike Snitzer ac75b09fc6 dm: use DMDEBUG macros now that they use pr_debug variants
Now that DMDEBUG uses pr_debug and DMDEBUG_LIMIT uses
pr_debug_ratelimited cleanup DM's 2 direct pr_debug callers to use
them to get the benefit of consistent DM_FMT formatting of debugging
messages.

While doing so, dm-mpath.c:dm_report_EIO() was switched over to using
DMDEBUG_LIMIT due to the potential for error handling floods in the IO
completion path.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20 17:09:47 -04:00
Christoph Hellwig ac7c5675fa blk-mq: allow blk_mq_make_request to consume the q_usage_counter reference
blk_mq_make_request currently needs to grab an q_usage_counter
reference when allocating a request.  This is because the block layer
grabs one before calling blk_mq_make_request, but also releases it as
soon as blk_mq_make_request returns.  Remove the blk_queue_exit call
after blk_mq_make_request returns, and instead let it consume the
reference.  This works perfectly fine for the block layer caller, just
device mapper needs an extra reference as the old problem still
persists there.  Open code blk_queue_enter_live in device mapper,
as there should be no other callers and this allows better documenting
why we do a non-try get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19 09:34:29 -06:00
Gabriel Krisman Bertazi 087615bf3a dm mpath: pass IO start time to path selector
The HST path selector needs this information to perform path
prediction. For request-based mpath, struct request's io_start_time_ns
is used, while for bio-based, use the start_time stored in dm_io.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15 10:29:36 -04:00
Satya Tangirala a892c8d52c block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.

We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.

We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.

Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.

Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.

Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 09:47:53 -06:00
Christoph Hellwig 8cf7961dab block: bypass ->make_request_fn for blk-mq drivers
Call blk_mq_make_request when no ->make_request_fn is set.  This is
safe now that blk_alloc_queue always sets up the pointer for make_request
based drivers.  This avoids an indirect call in the blk-mq driver I/O
fast path, which is rather expensive due to spectre mitigations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-25 09:45:44 -06:00
Linus Torvalds 9b06860d7c libnvdimm for 5.7
- Add support for region alignment configuration and enforcement to
   fix compatibility across architectures and PowerPC page size
   configurations.
 
 - Introduce 'zero_page_range' as a dax operation. This facilitates
   filesystem-dax operation without a block-device.
 
 - Introduce phys_to_target_node() to facilitate drivers that want to
   know resulting numa node if a given reserved address range was
   onlined.
 
 - Advertise a persistence-domain for of_pmem and papr_scm. The
   persistence domain indicates where cpu-store cycles need to reach in
   the platform-memory subsystem before the platform will consider them
   power-fail protected.
 
 - Promote numa_map_to_online_node() to a cross-kernel generic facility.
 
 - Save x86 numa information to allow for node-id lookups for reserved
   memory ranges, deploy that capability for the e820-pmem driver.
 
 - Pick up some miscellaneous minor fixes, that missed v5.6-final,
   including a some smatch reports in the ioctl path and some unit test
   compilation fixups.
 
 - Fixup some flexible-array declarations.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAl6LtIAACgkQHtKRamZ9
 iAIwRA/8CLVVuQpgHQ1tqK4h8CZPrISFXh7wy7uhocEU2xrDh6iGVnLztmoLRr2k
 5f8T9lRzreSAwIVL5DbGqP1pFncqIt9VMnKsFlaPMBGCBNR+hURY0iBCNjIT+jiq
 BOzLd52MR2rqJxeXGTMUbWrBrbmuj4mZPdmGVuFFe7GFRpoaVpCgOo+296eWa/ot
 gIOFUTonZY7STYjNvDok0TXCmiCFuJb+P+y5ldfCPShHvZhTiaF53jircja8vAjO
 G5dt8ixBKUK0rXRc4SEQsQhAZNcAFHb6Gy5lg4C2QzhTF374xTc9usJZNWbIE9iM
 5mipBYvjVuoY+XaCNZDkaRcJIy/jqB15O6l3QIWbZLGaK9m95YPp9LmkPFwd3JpO
 e3rO24ML471DxqB9iWIiJCNcBBocLOlnd6qAQTpppWDpGNbudwXvfsmKHmKIScSE
 x+IDCdscLmmm+WG2dLmLraWOVPu42xZFccoQCi4M3TTqfeB9pZ9XckFQ37zX62zG
 5t+7Ek+t1W4QVt/JQYVKH03XT15sqUpVknvx0Hl4Y5TtbDOkFLkO8RN0/HyExDef
 7iegS35kqTsM4EfZQ+9juKbI2JBAjHANcbj0V4dogqaRj6vr3akumBzUtuYqAofv
 qU3s9skmLsEemOJC+ns2PT8vl5dyIoeDfH0r2XvGWxYqolMqJpA=
 =sY4N
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm and dax updates from Dan Williams:
 "There were multiple touches outside of drivers/nvdimm/ this round to
  add cross arch compatibility to the devm_memremap_pages() interface,
  enhance numa information for persistent memory ranges, and add a
  zero_page_range() dax operation.

  This cycle I switched from the patchwork api to Konstantin's b4 script
  for collecting tags (from x86, PowerPC, filesystem, and device-mapper
  folks), and everything looks to have gone ok there. This has all
  appeared in -next with no reported issues.

  Summary:

   - Add support for region alignment configuration and enforcement to
     fix compatibility across architectures and PowerPC page size
     configurations.

   - Introduce 'zero_page_range' as a dax operation. This facilitates
     filesystem-dax operation without a block-device.

   - Introduce phys_to_target_node() to facilitate drivers that want to
     know resulting numa node if a given reserved address range was
     onlined.

   - Advertise a persistence-domain for of_pmem and papr_scm. The
     persistence domain indicates where cpu-store cycles need to reach
     in the platform-memory subsystem before the platform will consider
     them power-fail protected.

   - Promote numa_map_to_online_node() to a cross-kernel generic
     facility.

   - Save x86 numa information to allow for node-id lookups for reserved
     memory ranges, deploy that capability for the e820-pmem driver.

   - Pick up some miscellaneous minor fixes, that missed v5.6-final,
     including a some smatch reports in the ioctl path and some unit
     test compilation fixups.

   - Fixup some flexible-array declarations"

* tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
  dax: Move mandatory ->zero_page_range() check in alloc_dax()
  dax,iomap: Add helper dax_iomap_zero() to zero a range
  dax: Use new dax zero page method for zeroing a page
  dm,dax: Add dax zero_page_range operation
  s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
  dax, pmem: Add a dax operation zero_page_range
  pmem: Add functions for reading/writing page to/from pmem
  libnvdimm: Update persistence domain value for of_pmem and papr_scm device
  tools/test/nvdimm: Fix out of tree build
  libnvdimm/region: Fix build error
  libnvdimm/region: Replace zero-length array with flexible-array member
  libnvdimm/label: Replace zero-length array with flexible-array member
  ACPI: NFIT: Replace zero-length array with flexible-array member
  libnvdimm/region: Introduce an 'align' attribute
  libnvdimm/region: Introduce NDD_LABELING
  libnvdimm/namespace: Enforce memremap_compat_align()
  libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
  libnvdimm: Out of bounds read in __nd_ioctl()
  acpi/nfit: improve bounds checking for 'func'
  mm/memremap_pages: Introduce memremap_compat_align()
  ...
2020-04-08 21:03:40 -07:00
Linus Torvalds de3c913c6e - Fix excessive bio splitting that caused performance regressions.
- Fix logic bug in DM integrity discard support's integrity tag
   testing.
 
 - Fix DM integrity warning on ppc64le due to missing cast.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl6HbiETHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWiRFCACoiewv5jBvqzYvINy0FpgwLyU0SOOY
 11UCIzeLgQUPMZ4a5CUpPQuqSxKR3g7RAoD1EJZ1cyOnAJuk6A+VPkOLDioa4BEC
 uS7ifenihclFjcpWjQaKTNEhuTURYSjIwk1wqSwb7Fv+L2Uo+4XB8DvazBZdMq8+
 0EAO2CNl3r9Tkut0MRGr7DKrZa4QauaPsl7BvtzlLZbC3Nj2VncpwePYv9z7c7Ra
 g0zAq8IJlOPgDBUt4szbEAKiwGWQEiG6yxcyj5J3keU6mrHAg0mr08rxaapFkfSK
 F++3gTYjjLi8rQmczufvJ440t2kjK+b4dOfxMyjZccgjcoPxrxxkgaWq
 =Cj9X
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix excessive bio splitting that caused performance regressions

 - Fix logic bug in DM integrity discard support's integrity tag testing

 - Fix DM integrity warning on ppc64le due to missing cast

* tag 'for-5.7/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm integrity: fix logic bug in integrity tag testing
  Revert "dm: always call blk_queue_split() in dm_process_bio()"
  dm integrity: fix ppc64le warning
2020-04-03 14:44:48 -07:00
Mike Snitzer 120c9257f5 Revert "dm: always call blk_queue_split() in dm_process_bio()"
This reverts commit effd58c95f.

blk_queue_split() is causing excessive IO splitting -- because
blk_max_size_offset() depends on 'chunk_sectors' limit being set and
if it isn't (as is the case for DM targets!) it falls back to
splitting on a 'max_sectors' boundary regardless of offset.

"Fix" this by reverting back to _not_ using blk_queue_split() in
dm_process_bio() for normal IO (reads and writes).  Long-term fix is
still TBD but it should focus on training blk_max_size_offset() to
call into a DM provided hook (to call DM's max_io_len()).

Test results from simple misaligned IO test on 4-way dm-striped device
with chunksize of 128K and stripesize of 512K:

xfs_io -d -c 'pread -b 2m 224s 4072s' /dev/mapper/stripe_dev

before this revert:

253,0   21        1     0.000000000  2206  Q   R 224 + 4072 [xfs_io]
253,0   21        2     0.000008267  2206  X   R 224 / 480 [xfs_io]
253,0   21        3     0.000010530  2206  X   R 224 / 256 [xfs_io]
253,0   21        4     0.000027022  2206  X   R 480 / 736 [xfs_io]
253,0   21        5     0.000028751  2206  X   R 480 / 512 [xfs_io]
253,0   21        6     0.000033323  2206  X   R 736 / 992 [xfs_io]
253,0   21        7     0.000035130  2206  X   R 736 / 768 [xfs_io]
253,0   21        8     0.000039146  2206  X   R 992 / 1248 [xfs_io]
253,0   21        9     0.000040734  2206  X   R 992 / 1024 [xfs_io]
253,0   21       10     0.000044694  2206  X   R 1248 / 1504 [xfs_io]
253,0   21       11     0.000046422  2206  X   R 1248 / 1280 [xfs_io]
253,0   21       12     0.000050376  2206  X   R 1504 / 1760 [xfs_io]
253,0   21       13     0.000051974  2206  X   R 1504 / 1536 [xfs_io]
253,0   21       14     0.000055881  2206  X   R 1760 / 2016 [xfs_io]
253,0   21       15     0.000057462  2206  X   R 1760 / 1792 [xfs_io]
253,0   21       16     0.000060999  2206  X   R 2016 / 2272 [xfs_io]
253,0   21       17     0.000062489  2206  X   R 2016 / 2048 [xfs_io]
253,0   21       18     0.000066133  2206  X   R 2272 / 2528 [xfs_io]
253,0   21       19     0.000067507  2206  X   R 2272 / 2304 [xfs_io]
253,0   21       20     0.000071136  2206  X   R 2528 / 2784 [xfs_io]
253,0   21       21     0.000072764  2206  X   R 2528 / 2560 [xfs_io]
253,0   21       22     0.000076185  2206  X   R 2784 / 3040 [xfs_io]
253,0   21       23     0.000077486  2206  X   R 2784 / 2816 [xfs_io]
253,0   21       24     0.000080885  2206  X   R 3040 / 3296 [xfs_io]
253,0   21       25     0.000082316  2206  X   R 3040 / 3072 [xfs_io]
253,0   21       26     0.000085788  2206  X   R 3296 / 3552 [xfs_io]
253,0   21       27     0.000087096  2206  X   R 3296 / 3328 [xfs_io]
253,0   21       28     0.000093469  2206  X   R 3552 / 3808 [xfs_io]
253,0   21       29     0.000095186  2206  X   R 3552 / 3584 [xfs_io]
253,0   21       30     0.000099228  2206  X   R 3808 / 4064 [xfs_io]
253,0   21       31     0.000101062  2206  X   R 3808 / 3840 [xfs_io]
253,0   21       32     0.000104956  2206  X   R 4064 / 4096 [xfs_io]
253,0   21       33     0.001138823     0  C   R 4096 + 200 [0]

after this revert:

253,0   18        1     0.000000000  4430  Q   R 224 + 3896 [xfs_io]
253,0   18        2     0.000018359  4430  X   R 224 / 256 [xfs_io]
253,0   18        3     0.000028898  4430  X   R 256 / 512 [xfs_io]
253,0   18        4     0.000033535  4430  X   R 512 / 768 [xfs_io]
253,0   18        5     0.000065684  4430  X   R 768 / 1024 [xfs_io]
253,0   18        6     0.000091695  4430  X   R 1024 / 1280 [xfs_io]
253,0   18        7     0.000098494  4430  X   R 1280 / 1536 [xfs_io]
253,0   18        8     0.000114069  4430  X   R 1536 / 1792 [xfs_io]
253,0   18        9     0.000129483  4430  X   R 1792 / 2048 [xfs_io]
253,0   18       10     0.000136759  4430  X   R 2048 / 2304 [xfs_io]
253,0   18       11     0.000152412  4430  X   R 2304 / 2560 [xfs_io]
253,0   18       12     0.000160758  4430  X   R 2560 / 2816 [xfs_io]
253,0   18       13     0.000183385  4430  X   R 2816 / 3072 [xfs_io]
253,0   18       14     0.000190797  4430  X   R 3072 / 3328 [xfs_io]
253,0   18       15     0.000197667  4430  X   R 3328 / 3584 [xfs_io]
253,0   18       16     0.000218751  4430  X   R 3584 / 3840 [xfs_io]
253,0   18       17     0.000226005  4430  X   R 3840 / 4096 [xfs_io]
253,0   18       18     0.000250404  4430  Q   R 4120 + 176 [xfs_io]
253,0   18       19     0.000847708     0  C   R 4096 + 24 [0]
253,0   18       20     0.000855783     0  C   R 4120 + 176 [0]

Fixes: effd58c95f ("dm: always call blk_queue_split() in dm_process_bio()")
Cc: stable@vger.kernel.org
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Barry Marson <bmarson@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-03 11:32:19 -04:00
Vivek Goyal 4e4ced9379 dax: Move mandatory ->zero_page_range() check in alloc_dax()
zero_page_range() dax operation is mandatory for dax devices. Right now
that check happens in dax_zero_page_range() function. Dan thinks that's
too late and its better to do the check earlier in alloc_dax().

I also modified alloc_dax() to return pointer with error code in it in
case of failure. Right now it returns NULL and caller assumes failure
happened due to -ENOMEM. But with this ->zero_page_range() check, I
need to return -EINVAL instead.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20200401161125.GB9398@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-04-02 19:15:03 -07:00
Vivek Goyal cdf6cdcd3b dm,dax: Add dax zero_page_range operation
This patch adds support for dax zero_page_range operation to dm targets.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20200228163456.1587-5-vgoyal@redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-04-02 19:15:03 -07:00
Christoph Hellwig 3d745ea5b0 block: simplify queue allocation
Current make_request based drivers use either blk_alloc_queue_node or
blk_alloc_queue to allocate a queue, and then set up the make_request_fn
function pointer and a few parameters using the blk_queue_make_request
helper.  Simplify this by passing the make_request pointer to
blk_alloc_queue, and while at it merge the _node variant into the main
helper by always passing a node_id, and remove the superfluous gfp_mask
parameter.  A lower-level __blk_alloc_queue is kept for the blk-mq case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27 10:23:43 -06:00
Christoph Hellwig c6a564ffad block: move the part_stat* helpers from genhd.h to a new header
These macros are just used by a few files.  Move them out of genhd.h,
which is included everywhere into a new standalone header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 09:50:09 -06:00
Hou Tao 974f51e863 dm: fix congested_fn for request-based device
We neither assign congested_fn for requested-based blk-mq device nor
implement it correctly. So fix both.

Also, remove incorrect comment from dm_init_normal_md_queue and rename
it to dm_init_congested_fn.

Fixes: 4aa9c692e0 ("bdi: separate out congested state into a separate struct")
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-03 11:10:20 -05:00
Mikulas Patocka adc0daad36 dm: report suspended device during destroy
The function dm_suspended returns true if the target is suspended.
However, when the target is being suspended during unload, it returns
false.

An example where this is a problem: the test "!dm_suspended(wc->ti)" in
writecache_writeback is not sufficient, because dm_suspended returns
zero while writecache_suspend is in progress.  As is, without an
enhanced dm_suspended, simply switching from flush_workqueue to
drain_workqueue still emits warnings:
workqueue writecache-writeback: drain_workqueue() isn't complete after 10 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 100 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 200 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 300 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 400 tries

writecache_suspend calls flush_workqueue(wc->writeback_wq) - this function
flushes the current work. However, the workqueue may re-queue itself and
flush_workqueue doesn't wait for re-queued works to finish. Because of
this - the function writecache_writeback continues execution after the
device was suspended and then concurrently with writecache_dtr, causing
a crash in writecache_writeback.

We must use drain_workqueue - that waits until the work and all re-queued
works finish.

As a prereq for switching to drain_workqueue, this commit fixes
dm_suspended to return true after the presuspend hook and before the
postsuspend hook - just like during a normal suspend. It allows
simplifying the dm-integrity and dm-writecache targets so that they
don't have to maintain suspended flags on their own.

With this change use of drain_workqueue() can be used effectively.  This
change was tested with the lvm2 testsuite and cryptsetup testsuite and
the are no regressions.

Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Reported-by: Corey Marthaler <cmarthal@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-02-27 16:40:58 -05:00
Mike Snitzer 47ace7e012 dm: fix potential for q->make_request_fn NULL pointer
Move blk_queue_make_request() to dm.c:alloc_dev() so that
q->make_request_fn is never NULL during the lifetime of a DM device
(even one that is created without a DM table).

Otherwise generic_make_request() will crash simply by doing:
  dmsetup create -n test
  mount /dev/dm-N /mnt

While at it, move ->congested_data initialization out of
dm.c:alloc_dev() and into the bio-based specific init method.

Reported-by: Stefan Bader <stefan.bader@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1860231
Fixes: ff36ab3458 ("dm: remove request-based logic from make_request_fn wrapper")
Depends-on: c12c9a3c38 ("dm: various cleanups to md->queue initialization code")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-27 14:52:36 -05:00
Christoph Hellwig d41003513e block: rework zone reporting
Avoid the need to allocate a potentially large array of struct blk_zone
in the block layer by switching the ->report_zones method interface to
a callback model. Now the caller simply supplies a callback that is
executed on each reported zone, and private data for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 19:12:07 -07:00
Damien Le Moal 5eac3eb30c block: Remove partition support for zoned block devices
No known partitioning tool supports zoned block devices, especially the
host managed flavor with strong sequential write constraints.
Furthermore, there are also no known user nor use cases for partitioned
zoned block devices.

This patch removes partition device creation for zoned block devices,
which allows simplifying the processing of zone commands for zoned
block devices. A warning is added if a partition table is found on the
device.

For report zones operations no zone sector information remapping is
necessary anymore, simplifying the code. Of note is that remapping of
zone reports for DM targets is still necessary as done by
dm_remap_zone_report().

Similarly, remaping of a zone reset bio is not necessary anymore.
Testing for the applicability of the zone reset all request also becomes
simpler and only needs to check that the number of sectors of the
requested zone range is equal to the disk capacity.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 19:11:57 -07:00
Damien Le Moal ceeb373aa6 block: Simplify report zones execution
All kernel users of blkdev_report_zones() as well as applications use
through ioctl(BLKZONEREPORT) expect to potentially get less zone
descriptors than requested. As such, the use of the internal report
zones command execution loop implemented by blk_report_zones() is
not necessary and can even be harmful to performance by causing the
execution of inefficient small zones report command to service the
reminder of a requested zone array.

This patch removes blk_report_zones(), simplifying the code. Also
remove a now incorrect comment in dm_blk_report_zones().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier Gonzalez <javier@javigon.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 19:11:56 -07:00
Ajay Joshi 2e2d6f7e44 dm: add zone open, close and finish support
Implement REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH
support to allow explicit control of zone states.

Contains contributions from Matias Bjorling, Hans Holmberg and
Damien Le Moal.

Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 06:36:08 -07:00
Mikulas Patocka 123d87d553 dm: make dm_table_find_target return NULL
Currently, if we pass too high sector number to dm_table_find_target, it
returns zeroed dm_target structure and callers test if the structure is
zeroed with the macro dm_target_is_valid.

However, returning NULL is common practice to indicate errors.

This patch refactors the dm code, so that dm_table_find_target returns
NULL and its callers test the returned value for NULL. The macro
dm_target_is_valid is deleted. In alloc_targets, we no longer allocate an
extra zeroed target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:13:12 -04:00
Linus Torvalds f8c3500cd1 - virtio_pmem: The new virtio_pmem facility introduces a paravirtualized
persistent memory device that allows a guest VM to use DAX mechanisms to
   access a host-file with host-page-cache. It arranges for MAP_SYNC to
   be disabled and instead triggers a host fsync() when a 'write-cache
   flush' command is sent to the virtual disk device.
 
 - Miscellaneous small fixups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdMHwpAAoJEB7SkWpmfYgCUYoP/3vcgYBAaXNksyALF0iowPoP
 z4J0KoaOA1CzRFEQtCWUQa84CWj+XoSewwSeyrIkqKQvx/gghXblK+GVjVzBn0BD
 hmmiKr8af4DdxfzYdEXJp65cCpIiVMaJiGr20Aj9ObwvWJb4QZbz9q7hnPt6KgiI
 jVND3BpP3OERb4ZFcibdmJT5foKooMcXVG6+luVe+hc1+ZZQxJBsBaqie4brQIFq
 j59NX3HfHH2fr1vVwnVH0CO4tgbgYg9wZ2EivGu6wBWvORjrr7KiSSbOYP68EBtd
 lUoNps+vQtGnfXGwNzAjp1wuknrQYYh4/KMKjep7hiZD39rgyvBpbHbyynKzQCWV
 REe8cXr/nwphsENvBAUBiqY999EWVIxdT2iaVaSA6K/31JQAC5AFyxVK/P2Ke1SK
 rvePZ++iLQ1o4phTxQPNlVUqF9jOrFVVICGwMDqaqSkOsD9YKQdFClfOF/1ntlDz
 V0bs+Y0Pe8AJCd9ESep4X+vHAWRRIb4EQIuwLaX8RJoY+r1fGye9RPthpYYzvXKp
 DI2iJztFO3anzj2i9htNPUFIaiUmIhzEvG32O2If2yc5FL02hMpHPoFx6vHhe6s3
 f8OJ+olsJK+/IIrV8+DHqYvhzylOYIhmRTvIxIxaNDPHkhR1i2RDQ6KKK1YZmsr8
 MjAZ+Ym0GadDivs+wcM6
 =uAMG
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "Primarily just the virtio_pmem driver:

   - virtio_pmem

     The new virtio_pmem facility introduces a paravirtualized
     persistent memory device that allows a guest VM to use DAX
     mechanisms to access a host-file with host-page-cache. It arranges
     for MAP_SYNC to be disabled and instead triggers a host fsync()
     when a 'write-cache flush' command is sent to the virtual disk
     device.

   - Miscellaneous small fixups"

* tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  virtio_pmem: fix sparse warning
  xfs: disable map_sync for async flush
  ext4: disable map_sync for async flush
  dax: check synchronous mapping is supported
  dm: enable synchronous dax
  libnvdimm: add dax_dev sync flag
  virtio-pmem: Add virtio pmem driver
  libnvdimm: nd_region flush callback support
  libnvdimm, namespace: Drop uuid_t implementation detail
2019-07-18 10:52:08 -07:00
Damien Le Moal bd976e5272 block: Kill gfp_t argument of blkdev_report_zones()
Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In
preparation of using vmalloc() for large report buffer and zone array
allocations used by this function, remove its "gfp_t gfp_mask" argument
and rely on the caller context to use memalloc_noio_save/restore() where
necessary (block layer zone revalidation and dm-zoned I/O error path).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-11 20:04:37 -06:00
Pankaj Gupta 2e9ee0955d dm: enable synchronous dax
This patch sets dax device 'DAXDEV_SYNC' flag if all the target
devices of device mapper support synchrononous DAX. If device
mapper consists of both synchronous and asynchronous dax devices,
we don't set 'DAXDEV_SYNC' flag.

'dm_table_supports_dax' is refactored to pass 'iterate_devices_fn'
as argument so that the callers can pass the appropriate functions.

Suggested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-05 15:19:10 -07:00
Pankaj Gupta fefc1d97fa libnvdimm: add dax_dev sync flag
This patch adds 'DAXDEV_SYNC' flag which is set
for nd_region doing synchronous flush. This later
is used to disable MAP_SYNC functionality for
ext4 & xfs filesystem for devices don't support
synchronous flush.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-05 15:19:10 -07:00
Linus Torvalds b2ad81363f libnvdimm fixes v5.2-rc2
- Fix a regression that disabled device-mapper dax support
 
 - Remove unnecessary hardened-user-copy overhead (>30%) for dax
   read(2)/write(2).
 
 - Fix some compilation warnings.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJc6WWQAAoJEB7SkWpmfYgCpVwP/0Vfq/3ChbH5T7s4x2MkpLX+
 metYwCyzPJK32mVMbAmizGWEBn8Np+eZcU7jvKYpDXJLWdbUUz4oZD04RYmgkYp7
 SHmjn9VdpfMSziWUx6zrrbyAtBq04x7GT7IIkCzlGIuNVCYqXBnRSVGz06tDFEEd
 pU9HtZr32C425pdFK5D4sorJED2JKG7CwLPdSVHayuyHmg7jp78T7U5Y31WgOhSw
 +JF6UwQIJ+UPg30PYBPG32Zmh8E7Fv/AaYF3JGbp4xRS+B/xbakZhJtYuBzWRjlp
 BlwUg9nUaVgEnjE9KpTcJk8VlXDz6ZjpYXXdY4Hv5g+PPWm5kdZBhPYjaymrtI3o
 7DjtKmNd4F5qhU06oTXtFoBbgoiOBM7fOqsyVZ6tsNguVojlt8lnUvkTKqvznw4n
 K4TGzi0Zgu511umMumF1Q/d0BlNXz+gptcC4qwuEUyQa7sEPSWSfcC66SvY/Y5ym
 VGG4roO3Jz6p3JniuFEXakifzU57vPPv7OxGD3d0PKUSDHVU5yPjWRpJju8wJeVW
 DmTZ+SBo2Q/YP9vDlULPqxGJNkP31SaRg/9PnB8W1z2yqyuA+Pjv+Qjt1X618PFq
 1c2+ufeJoOb1Zc3k6Jw1bovilpb2GDW+4QucC3J0/zFtK00PYcGyyqo3jWlUgINf
 QWPgwBIW/yFcb7xOazFS
 =nko1
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm fixes from Dan Williams:

 - Fix a regression that disabled device-mapper dax support

 - Remove unnecessary hardened-user-copy overhead (>30%) for dax
   read(2)/write(2).

 - Fix some compilation warnings.

* tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
  dax: Arrange for dax_supported check to span multiple devices
  libnvdimm: Fix compilation warnings with W=1
2019-05-25 10:11:23 -07:00
Michael Lass 51b86f9a8d dm: make sure to obey max_io_len_target_boundary
Commit 61697a6abd ("dm: eliminate 'split_discard_bios' flag from DM
target interface") incorrectly removed code from
__send_changing_extent_only() that is required to impose a per-target IO
boundary on IO that exceeds max_io_len_target_boundary().  Otherwise
"special" IO (e.g. DISCARD, WRITE SAME, WRITE ZEROES) can write beyond
where allowed.

Fix this by restoring the max_io_len_target_boundary() limit in
__send_changing_extent_only()

Fixes: 61697a6abd ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Michael Lass <bevan@bi-co.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-21 19:15:20 -04:00
Dan Williams 7bf7eac8d6 dax: Arrange for dax_supported check to span multiple devices
Pankaj reports that starting with commit ad428cdb52 "dax: Check the
end of the block-device capacity with dax_direct_access()" device-mapper
no longer allows dax operation. This results from the stricter checks in
__bdev_dax_supported() that validate that the start and end of a
block-device map to the same 'pagemap' instance.

Teach the dax-core and device-mapper to validate the 'pagemap' on a
per-target basis. This is accomplished by refactoring the
bdev_dax_supported() internals into generic_fsdax_supported() which
takes a sector range to validate. Consequently generic_fsdax_supported()
is suitable to be used in a device-mapper ->iterate_devices() callback.
A new ->dax_supported() operation is added to allow composite devices to
split and route upper-level bdev_dax_supported() requests.

Fixes: ad428cdb52 ("dax: Check the end of the block-device...")
Cc: <stable@vger.kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Tested-by: Pankaj Gupta <pagupta@redhat.com>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-05-20 15:02:08 -07:00
Sheetal Singala 8454fca4f5 dm: fix a couple brace coding style issues
Signed-off-by: Sheetal Singala <2396sheetal@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-16 10:09:21 -04:00
Peng Wang 514cf4f881 dm: only initialize md->dax_dev if CONFIG_DAX_DRIVER is enabled
md->dax_dev defaults to NULL and there is no need to initialize it
if CONFIG_DAX_DRIVER is disabled.

Signed-off-by: Peng Wang <rocking@whu.edu.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26 11:28:17 -04:00
Mike Snitzer bcb44433bb dm: disable DISCARD if the underlying storage no longer supports it
Storage devices which report supporting discard commands like
WRITE_SAME_16 with unmap, but reject discard commands sent to the
storage device.  This is a clear storage firmware bug but it doesn't
change the fact that should a program cause discards to be sent to a
multipath device layered on this buggy storage, all paths can end up
failed at the same time from the discards, causing possible I/O loss.

The first discard to a path will fail with Illegal Request, Invalid
field in cdb, e.g.:
 kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
 kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current]
 kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb
 kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00
 kernel: blk_update_request: critical target error, dev sdfn, sector 10487808

The SCSI layer converts this to the BLK_STS_TARGET error number, the sd
device disables its support for discard on this path, and because of the
BLK_STS_TARGET error multipath fails the discard without failing any
path or retrying down a different path.  But subsequent discards can
cause path failures.  Any discards sent to the path which already failed
a discard ends up failing with EIO from blk_cloned_rq_check_limits with
an "over max size limit" error since the discard limit was set to 0 by
the sd driver for the path.  As the error is EIO, this now fails the
path and multipath tries to send the discard down the next path.  This
cycle continues as discards are sent until all paths fail.

Fix this by training DM core to disable DISCARD if the underlying
storage already did so.

Also, fix branching in dm_done() and clone_endio() to reflect the
mutually exclussive nature of the IO operations in question.

Cc: stable@vger.kernel.org
Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-04 15:33:59 -04:00