Commit Graph

358 Commits

Author SHA1 Message Date
Christoph Hellwig 29ff57c610 block: move the start_sect field to struct block_device
Move the start_sect field to struct block_device in preparation
of killing struct hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig 15e3d2c5cd block: move disk stat accounting to struct block_device
Move the dkstats and stamp field to struct block_device in preparation
of killing struct hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig a782483cc1 block: remove the nr_sects field in struct hd_struct
Now that the hd_struct always has a block device attached to it, there is
no need for having two size field that just get out of sync.

Additionally the field in hd_struct did not use proper serialization,
possibly allowing for torn writes.  By only using the block_device field
this problem also gets fixed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de>			[bcache]
Acked-by: Chao Yu <yuchao0@huawei.com>			[f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig 22ae8ce8b8 block: simplify bdev/disk lookup in blkdev_get
To simplify block device lookup and a few other upcoming areas, make sure
that we always have a struct block_device available for each disk and
each partition, and only find existing block devices in bdget.  The only
downside of this is that each device and partition uses a little more
memory.  The upside will be that a lot of code can be simplified.

With that all we need to look up the block device is to lookup the inode
and do a few sanity checks on the gendisk, instead of the separate lookup
for the gendisk.  For blk-cgroup which wants to access a gendisk without
opening it, a new blkdev_{get,put}_no_open low-level interface is added
to replace the previous get_gendisk use.

Note that the change to look up block device directly instead of the two
step lookup using struct gendisk causes a subtile change in behavior:
accessing a non-existing partition on an existing block device can now
cause a call to request_module.  That call is harmless, and in practice
no recent system will access these nodes as they aren't created by udev
and static /dev/ setups are unusual.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig efdc41c8d4 block: use put_device in put_disk
Use put_device to put the device instead of poking into the internals
and using kobject_put.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig e79319af6d block: use disk_part_iter_exit in disk_part_iter_next
Call disk_part_iter_exit in disk_part_iter_next instead of duplicating
the functionality.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig 449f4ec989 block: remove the update_bdev parameter to set_capacity_revalidate_and_notify
The update_bdev argument is always set to true, so remove it.  Also
rename the function to the slighly less verbose set_capacity_and_notify,
as propagating the disk size to the block device isn't really
revalidation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:34:14 -07:00
Christoph Hellwig e2b6b30187 block: fix the kerneldoc comment for __register_blkdev
Switch the comment to talk about __register_blkdev instead of
register_blkdev and document the new probe parameter.

Fixes: 3da1a61e7046 ("block: add an optional probe callback to major_names")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:14:31 -07:00
Christoph Hellwig e418de3abc block: switch gendisk lookup to a simple xarray
Now that bdev_map is only used for finding gendisks, we can use
a simple xarray instead of the regions tracking structure for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:14:31 -07:00
Christoph Hellwig a160c6159d block: add an optional probe callback to major_names
Add a callback to the major_names array that allows a driver to override
how to probe for dev_t that doesn't currently have a gendisk registered.
This will help separating the lookup of the gendisk by dev_t vs probe
action for a not currently registered dev_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:14:30 -07:00
Christoph Hellwig bd8eff3ba2 block: rework requesting modules for unclaimed devices
Instead of reusing the ranges in bdev_map, add a new helper that is
called if no ranges was found.  This is a first step to unpeel and
eventually remove the complex ranges structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:14:30 -07:00
Christoph Hellwig e49fbbbf0a block: split block_class_lock
Split the block_class_lock mutex into one each to protect bdev_map
and major_names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:14:30 -07:00
Christoph Hellwig 62b508f8b6 block: open code kobj_map into in block/genhd.c
Copy and paste the kobj_map functionality in the block code in preparation
for completely rewriting it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:14:30 -07:00
Christoph Hellwig 6b3ba9762f block: cleanup del_gendisk a bit
Merge three hidden gendisk checks into one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:14:29 -07:00
Christoph Hellwig 98f49b63e8 block: remove set_device_ro
Fold set_device_ro into its only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:14:29 -07:00
Christoph Hellwig 7e890c37c2 block: add a return value to set_capacity_revalidate_and_notify
Return if the function ended up sending an uevent or not.

Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-12 13:59:04 -07:00
Christoph Hellwig 10ed16662d block: add a bdget_part helper
All remaining callers of bdget() outside of fs/block_dev.c want to get a
reference to the struct block_device for a given struct hd_struct.  Add
a helper just for that and then mark bdget static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-05 10:38:33 -06:00
Christoph Hellwig 8a63a86e1f block: use bd_partno in bdevname
No need to go through the hd_struct to find the partition number.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:18:58 -06:00
Christoph Hellwig 9301fe7343 block: cleanup partition scanning in register_disk
Use blkdev_get_by_dev instead of open coding it using bdget_disk +
blkdev_get, and split the code to read the partition table into a
separate helper to make it a little more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23 10:43:19 -06:00
Christoph Hellwig 38430f0876 block: move the NEED_PART_SCAN flag to struct gendisk
We can only scan for partitions on the whole disk, so move the flag
from struct block_device to struct gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23 10:43:18 -06:00
Christoph Hellwig 95f6f3a46f block: add a bdev_check_media_change helper
Like check_disk_changed, except that it does not call ->revalidate_disk
but leaves that to the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10 09:32:30 -06:00
Christoph Hellwig b8086d3f5a block: use revalidate_disk_size in set_capacity_revalidate_and_notify
Only virtio_blk and xen-blkfront set the revalidate argument to true,
and both do not implement the ->revalidate_disk method.  So switch
to the helper that just updates the size instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02 08:00:07 -06:00
Christoph Hellwig f4ad06f2bb block: rename bd_invalidated
Replace bd_invalidate with a new BDEV_NEED_PART_SCAN flag in a bd_flags
variable to better describe the condition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02 08:00:02 -06:00
Christoph Hellwig 1f06959bd2 block: remove the unused q argument to part_in_flight and part_in_flight_rw
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:31 -06:00
Christoph Hellwig 8328eb2836 block: remove the disk argument to delete_partition
We can trivially derive the gendisk from the hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:25 -06:00
Christoph Hellwig f93af2a494 block: cleanup __alloc_disk_node
Use early returns and goto-based unwinding to simplify the flow a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Randy Dunlap 0d20dcc277 block: genhd: delete duplicated words
Drop the repeated word "to" in multiple places.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-31 16:29:47 -06:00
Boris Burkov ef45fe470e blk-cgroup: show global disk stats in root cgroup io.stat
In order to improve consistency and usability in cgroup stat accounting,
we would like to support the root cgroup's io.stat.

Since the root cgroup has processes doing io even if the system has no
explicitly created cgroups, we need to be careful to avoid overhead in
that case.  For that reason, the rstat algorithms don't handle the root
cgroup, so just turning the file on wouldn't give correct statistics.

To get around this, we simulate flushing the iostat struct by filling it
out directly from global disk stats. The result is a root cgroup io.stat
file consistent with both /proc/diskstats and io.stat.

Note that in order to collect the disk stats, we needed to iterate over
devices. To facilitate that, we had to change the linkage of a disk_type
to external so that it can be used from blk-cgroup.c to iterate over
disks.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Boris Burkov <boris@bur.io>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17 20:18:00 -06:00
Christoph Hellwig a564e23f0f md: switch to ->check_events for media change notifications
md is the last driver using the legacy media_changed method.  Switch
it over to (not so) new ->clear_events approach, which also removes the
need for the ->revalidate_disk method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: remove unused 'bdops' variable in disk_clear_events()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 16:19:47 -06:00
Luis Chamberlain e8c7d14ac6 block: revert back to synchronous request_queue removal
Commit dc9edc44de ("block: Fix a blk_exit_rl() regression") merged on
v4.12 moved the work behind blk_release_queue() into a workqueue after a
splat floated around which indicated some work on blk_release_queue()
could sleep in blk_exit_rl(). This splat would be possible when a driver
called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
as its final call) from an atomic context.

blk_put_queue() decrements the refcount for the request_queue kobject, and
upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
now removed through commit db6d995235 ("block: remove request_list code")
on v5.0, we reserve the right to be able to sleep within
blk_release_queue() context.

The last reference for the request_queue must not be called from atomic
context. *When* the last reference to the request_queue reaches 0 varies,
and so let's take the opportunity to document when that is expected to
happen and also document the context of the related calls as best as
possible so we can avoid future issues, and with the hopes that the
synchronous request_queue removal sticks.

We revert back to synchronous request_queue removal because asynchronous
removal creates a regression with expected userspace interaction with
several drivers. An example is when removing the loopback driver, one
uses ioctls from userspace to do so, but upon return and if successful,
one expects the device to be removed. Likewise if one races to add another
device the new one may not be added as it is still being removed. This was
expected behavior before and it now fails as the device is still present
and busy still. Moving to asynchronous request_queue removal could have
broken many scripts which relied on the removal to have been completed if
there was no error. Document this expectation as well so that this
doesn't regress userspace again.

Using asynchronous request_queue removal however has helped us find
other bugs. In the future we can test what could break with this
arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.

While at it, update the docs with the context expectations for the
request_queue / gendisk refcount decrement, and make these
expectations explicit by using might_sleep().

Fixes: dc9edc44de ("block: Fix a blk_exit_rl() regression")
Suggested-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Luis Chamberlain 763b58923a block: clarify context for refcount increment helpers
Let us clarify the context under which the helpers to increment the
refcount for the gendisk and request_queue can be called under. We
make this explicit on the places where we may sleep with might_sleep().

We don't address the decrement context yet, as that needs some extra
work and fixes, but will be addressed in the next patch.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Luis Chamberlain b5bd357cf8 block: add docs for gendisk / request_queue refcount helpers
This adds documentation for the gendisk / request_queue refcount
helpers.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:57 -06:00
Konstantin Khlebnikov 8ab1d40a64 block: remove rcu_read_lock() from part_stat_lock()
The RCU lock is required only in disk_map_sector_rcu() to lookup the
partition.  After that request holds reference to related hd_struct.

Replace get_cpu() with preempt_disable() - returned cpu index is unused.

[hch: rebased]

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00
Christoph Hellwig 58d4f14fc3 block: always use a percpu variable for disk stats
percpu variables have a perfectly fine working stub implementation
for UP kernels, so use that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00
Christoph Hellwig 10ec5e86f9 block: merge part_{inc,dev}_in_flight into their only callers
part_inc_in_flight and part_dec_in_flight only have one caller each, and
those callers are purely for bio based drivers.  Merge each function into
the only caller, and remove the superflous blk-mq checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19 09:35:24 -06:00
Christoph Hellwig b2f609e191 block: move the blk-mq calls out of part_in_flight{,_rw}
Don't bother to call part_in_flight / part_in_flight_rw on blk-mq
devices, just call the blk-mq versions directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19 09:35:24 -06:00
Ming Lei 27eb3af9a3 block: don't hold part0's refcount in IO path
gendisk can't be gone when there is IO activity, so not hold
part0's refcount in IO path.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Yufen Yu <yuyufen@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-12 20:31:40 -06:00
Ming Lei 07c4e1e834 block: only define 'nr_sects_seq' in hd_part for 32bit SMP
The seqcount of 'nr_sects_seq' is only needed in case of 32bit SMP,
so define it just for 32bit SMP.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Yufen Yu <yuyufen@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-12 20:31:39 -06:00
Ming Lei b7d6c30333 block: fix use-after-free on cached last_lookup partition
delete_partition() clears the cached last_lookup partition. However the
.last_lookup cache may be overwritten by one IO path after it is cleared
from delete_partition(). Then another IO path may use the cached deleting
partition after hd_struct_free() is called, then use-after-free is triggered
on the cached partition.

Fixes the issue by the following approach:

1) always get the partition's refcount via hd_struct_try_get() before
setting .last_lookup

2) move clearing .last_lookup from delete_partition() to hd_struct_free()
which is the release handle of the partition's percpu-refcount, so that no
IO path can cache deleteing partition via .last_lookup.

It is one candidate approach of Yufen's patch[1] which adds overhead
in fast path by indirect lookup which may introduce one extra cacheline
in IO path. Also this patch relies on percpu-refcount's protection, and
it is easier to understand and verify.

[1] https://lore.kernel.org/linux-block/20200109013551.GB9655@ming.t460p/T/#t

Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-12 20:31:39 -06:00
Christoph Hellwig 3c5d202b55 bdi: remove bdi_register_owner
Split out a new bdi_set_owner helper to set the owner, and move the policy
for creating the bdi name back into genhd.c, where it belongs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:15:13 -06:00
Christoph Hellwig 9bc5c397d8 block: fold bdev_unhash_inode into invalidate_partition
invalidate_partition and bdev_unhash_inode are always paired, and
invalidate_partition already does an icache lookup for the block device
inode.  Piggy back on that to remove the inode from the hash.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-20 11:33:00 -06:00
Christoph Hellwig 02d33b6771 block: mark invalidate_partition static
invalidate_partition is only used in genhd.c, so mark it static.  Also
drop the return value given that is is always ignored.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-20 11:32:59 -06:00
Christoph Hellwig cddae808ae block: pass a hd_struct to delete_partition
All callers have the hd_struct at hand, so pass it instead of performing
another lookup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-20 11:32:59 -06:00
Linus Torvalds 10f36b1e80 for-5.7/block-2020-03-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJCoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvziEACqQC+QRKiqR6X5yaPWJ9LqjKE7lfI1PUb7
 0a1z1mKuf8d6z0qNleUwdSOEaS5zJiswou2K8GLvEtTQH41QYsQkxc9GLjAyTveK
 szAyzZaa3BNUy9hkczm9i2arv3fI8XoTE3JvRM0e9wL8fBJDYCtKtHFJvF4hisOQ
 ydaJlU6tcwzd9bdV7K5dLwBxu3AeAJjzS3Tyfw25u9N9O/btUxJ91RTqBb2+Xeoz
 AVasfRlAqf/CzdjxCCmDgWE2QM4852pAeQ7UJJBGISNWNoiwkezMg+6HD0jEOLee
 bQ8uDyQdihIWTY+/zQasotX8/71uLV8QgtjWLXR9zrjrubIBWHGzoWSQ4kPg5DfQ
 bJmKO0VvWN2sshZEpWvzzAFGYxZViNphbK2Pb4hKOcv7jtMcC8mmEogh/7EqbD/n
 KB3IM9qVoXM8INm5o0dTy5uDRJxiHiHYkqsZaKz55BB/R4Geym5TINT3nXgxhQrn
 JoSwp4zdm3/NJOySruDi2eETqWJC2bsz3FsQSyCQTPOuP0nLtFKBb1UKHpmYTCXG
 H4LCyCKFJ6s006qBcdaNPZBw1mrSNwoxEulHnpYA4BFfPeXi72yrnMZQkdwWONpW
 LIVuD0hBm8X/pulbvEEdjzXBqZVkqK3xFX+uX5+bnwwaUKddXAC/h9SQKpBP2Mbb
 AeZToMklKw==
 =6Glq
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Online capacity resizing (Balbir)

 - Number of hardware queue change fixes (Bart)

 - null_blk fault injection addition (Bart)

 - Cleanup of queue allocation, unifying the node/no-node API
   (Christoph)

 - Cleanup of genhd, moving code to where it makes sense (Christoph)

 - Cleanup of the partition handling code (Christoph)

 - disk stat fixes/improvements (Konstantin)

 - BFQ improvements (Paolo)

 - Various fixes and improvements

* tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block: (72 commits)
  block: return NULL in blk_alloc_queue() on error
  block: move bio_map_* to blk-map.c
  Revert "blkdev: check for valid request queue before issuing flush"
  block: simplify queue allocation
  bcache: pass the make_request methods to blk_queue_make_request
  null_blk: use blk_mq_init_queue_data
  block: add a blk_mq_init_queue_data helper
  block: move the ->devnode callback to struct block_device_operations
  block: move the part_stat* helpers from genhd.h to a new header
  block: move block layer internals out of include/linux/genhd.h
  block: move guard_bio_eod to bio.c
  block: unexport get_gendisk
  block: unexport disk_map_sector_rcu
  block: unexport disk_get_part
  block: mark part_in_flight and part_in_flight_rw static
  block: mark block_depr static
  block: factor out requeue handling from dispatch code
  block/diskstats: replace time_in_queue with sum of request times
  block/diskstats: accumulate all per-cpu counters in one pass
  block/diskstats: more accurate approximation of io_ticks for slow disks
  ...
2020-03-30 11:20:13 -07:00
Christoph Hellwig 348e114bbd block: move the ->devnode callback to struct block_device_operations
There really isn't any good reason to stash a method directly into
struct gendisk.  Move it together with the other block device
operations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27 09:50:05 -06:00
Christoph Hellwig 1b4d4dbdae block: unexport get_gendisk
get_gendisk is not used by any modular code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 09:50:08 -06:00
Christoph Hellwig a7818aedda block: unexport disk_map_sector_rcu
disk_map_sector_rcu is not used by any modular code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 09:50:08 -06:00
Christoph Hellwig 572e7fc85b block: unexport disk_get_part
disk_get_part is not used by any modular code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 09:50:08 -06:00
Christoph Hellwig 6005771c17 block: mark part_in_flight and part_in_flight_rw static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 09:50:08 -06:00
Christoph Hellwig 31eb618679 block: mark block_depr static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 09:50:08 -06:00
Konstantin Khlebnikov 8cd5b8fc00 block/diskstats: replace time_in_queue with sum of request times
Column "time_in_queue" in diskstats is supposed to show total waiting time
of all requests. I.e. value should be equal to the sum of times from other
columns. But this is not true, because column "time_in_queue" is counted
separately in jiffies rather than in nanoseconds as other times.

This patch removes redundant counter for "time_in_queue" and shows total
time of read, write, discard and flush requests.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 08:49:12 -06:00
Konstantin Khlebnikov ea18e0f0a6 block/diskstats: accumulate all per-cpu counters in one pass
Reading /proc/diskstats iterates over all cpus for summing each field.
It's faster to sum all fields in one pass.

Hammering /proc/diskstats with fio shows 2x performance improvement:

fio --name=test --numjobs=$JOBS --filename=/proc/diskstats \
    --size=1k --bs=1k --fallocate=none --create_on_open=1 \
    --time_based=1 --runtime=10 --invalidate=0 --group_report

	  JOBS=1	JOBS=10
Before:	  7k iops	64k iops
After:	 18k iops      120k iops

Also this way code is more compact:

add/remove: 1/0 grow/shrink: 0/2 up/down: 194/-1540 (-1346)
Function                                     old     new   delta
part_stat_read_all                             -     194    +194
diskstats_show                              1344     631    -713
part_stat_show                              1219     392    -827
Total: Before=14966947, After=14965601, chg -0.01%

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-25 08:49:10 -06:00
Christoph Hellwig 3ad5cee5cd block: move sysfs methods shared by disks and partitions to genhd.c
Move the sysfs _show methods that are used both on the full disk and
partition nodes to genhd.c instead of hiding them in the partitioning
code.  Also move the declaration for these methods to block/blk.h so
that we don't expose them to drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24 07:57:07 -06:00
Christoph Hellwig 5cbd28e3ce block: move disk_name and related helpers out of partition-generic.c
Thes functions aren't really related to partition support, so move them
to a more suitable place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24 07:57:07 -06:00
Christoph Hellwig d2332c5c04 block: remove the blk_lookup_devt export
This function is only used by init/do_mounts.c, which can't be modular.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24 07:57:07 -06:00
Balbir Singh e598a72fae block/genhd: Notify udev about capacity change
Allow block/genhd to notify user space (via udev) about disk size changes
using a new helper set_capacity_revalidate_and_notify(), which is a wrapper
on top of set_capacity(). set_capacity_revalidate_and_notify() will only
notify via udev if the current capacity or the target capacity is not zero
and iff the capacity changes.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Someswarudu Sangaraju <ssomesh@amazon.com>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-18 15:13:21 -06:00
Shin'ichiro Kawasaki b53df2e744 block: Fix partition support for host aware zoned block devices
Commit b72053072c ("block: allow partitions on host aware zone
devices") introduced the helper function disk_has_partitions() to check
if a given disk has valid partitions. However, since this function result
directly depends on the disk partition table length rather than the
actual existence of valid partitions in the table, it returns true even
after all partitions are removed from the disk. For host aware zoned
block devices, this results in zone management support to be kept
disabled even after removing all partitions.

Fix this by changing disk_has_partitions() to walk through the partition
table entries and return true if and only if a valid non-zero size
partition is found.

Fixes: b72053072c ("block: allow partitions on host aware zone devices")
Cc: stable@vger.kernel.org # 5.5
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-12 07:54:39 -06:00
Konstantin Khlebnikov b686631865 block: add iostat counters for flush requests
Requests that triggers flushing volatile writeback cache to disk (barriers)
have significant effect to overall performance.

Block layer has sophisticated engine for combining several flush requests
into one. But there is no statistics for actual flushes executed by disk.
Requests which trigger flushes usually are barriers - zero-size writes.

This patch adds two iostat counters into /sys/class/block/$dev/stat and
/proc/diskstats - count of completed flush requests and their total time.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-21 09:06:47 -07:00
Damien Le Moal 737eb78e82 block: Delay default elevator initialization
When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
the only information known about the device is the number of hardware
queues as the block device scan by the device driver is not completed
yet for most drivers. The device type and elevator required features
are not set yet, preventing to correctly select the default elevator
most suitable for the device.

This currently affects all multi-queue zoned block devices which default
to the "none" elevator instead of the required "mq-deadline" elevator.
These drives currently include host-managed SMR disks connected to a
smartpqi HBA and null_blk block devices with zoned mode enabled.
Upcoming NVMe Zoned Namespace devices will also be affected.

Fix this by adding the boolean elevator_init argument to
blk_mq_init_allocated_queue() to control the execution of
elevator_init_mq(). Two cases exist:
1) elevator_init = false is used for calls to
   blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
   case, a call to elevator_init_mq() is added to __device_add_disk(),
   resulting in the delayed initialization of the queue elevator
   after the device driver finished probing the device information. This
   effectively allows elevator_init_mq() access to more information
   about the device.
2) elevator_init = true preserves the current behavior of initializing
   the elevator directly from blk_mq_init_allocated_queue(). This case
   is used for the special request based DM devices where the device
   gendisk is created before the queue initialization and device
   information (e.g. queue limits) is already known when the queue
   initialization is executed.

Additionally, to make sure that the elevator initialization is never
done while requests are in-flight (there should be none when the device
driver calls device_add_disk()), freeze and quiesce the device request
queue before calling blk_mq_init_sched() in elevator_init_mq().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-05 19:52:34 -06:00
Akinobu Mita 1624b0b200 block: fix sysfs module parameters directory path in comment
The runtime configurable module parameter files are located under
/sys/module/MODULENAME/parameters, not /sys/module/MODULENAME.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-16 10:16:33 -06:00
Gustavo A. R. Silva 78b90a2ce8 block: genhd: Use struct_size() helper
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.

So, replace the following form:

sizeof(*new_ptbl) + target * sizeof(new_ptbl->part[0])

with:

struct_size(new_ptbl, part, target)

Also, notice that variable size is unnecessary, hence it is removed.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:46:09 -06:00
Bart Van Assche 33c826ef19 block: Convert blk_invalidate_devt() header into a non-kernel-doc header
This patch avoids that the kernel-doc tool warns about this function
header when building with W=1.

Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-31 15:12:34 -06:00
Christoph Hellwig 3dcf60bcb6 block: add SPDX tags to block layer files missing licensing information
Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 16:12:03 -06:00
Yufen Yu 6fcc44d1d7 block: fix use-after-free on gendisk
commit 2da78092dd "block: Fix dev_t minor allocation lifetime"
specifically moved blk_free_devt(dev->devt) call to part_release()
to avoid reallocating device number before the device is fully
shutdown.

However, it can cause use-after-free on gendisk in get_gendisk().
We use md device as example to show the race scenes:

Process1		Worker			Process2
md_free
						blkdev_open
del_gendisk
  add delete_partition_work_fn() to wq
  						__blkdev_get
						get_gendisk
put_disk
  disk_release
    kfree(disk)
    						find part from ext_devt_idr
						get_disk_and_module(disk)
    					  	cause use after free

    			delete_partition_work_fn
			put_device(part)
    		  	part_release
		    	remove part from ext_devt_idr

Before <devt, hd_struct pointer> is removed from ext_devt_idr by
delete_partition_work_fn(), we can find the devt and then access
gendisk by hd_struct pointer. But, if we access the gendisk after
it have been freed, it can cause in use-after-freeon gendisk in
get_gendisk().

We fix this by adding a new helper blk_invalidate_devt() in
delete_partition() and del_gendisk(). It replaces hd_struct
pointer in idr with value 'NULL', and deletes the entry from
idr in part_release() as we do now.

Thanks to Jan Kara for providing the solution and more clear comments
for the code.

Fixes: 2da78092dd ("block: Fix dev_t minor allocation lifetime")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22 09:48:12 -06:00
Martin Wilck cdf3e3deb7 block: check_events: don't bother with events if unsupported
Drivers now report to the block layer if they support media change
events. If this is not the case, there's no need to allocate the event
structure, and all event handling code can effectively be skipped. This
simplifies code flow in particular for non-removable sd devices.

This effectively reverts commit 75e3f3ee3c ("block: always allocate
genhd->ev if check_events is implemented").

The sysfs files for the events are kept in place even if no events are
supported, as user space may rely on them being present. The only
difference is that an error code is now returned if the user tries to
set poll_msecs.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-12 13:35:28 -06:00
Martin Wilck c92e2f04b3 block: disk_events: introduce event flags
Currently, an empty disk->events field tells the block layer not to
forward media change events to user space. This was done in commit
7c88a168da ("block: don't propagate unlisted DISK_EVENTs to userland")
in order to avoid events from "fringe" drivers to be forwarded to user
space. By doing so, the block layer lost the information which events
were supported by a particular block device, and most importantly,
whether or not a given device supports media change events at all.

Prepare for not interpreting the "events" field this way in the future
any more. This is done by adding an additional field "event_flags" to
struct gendisk, and two flag bits that can be set to have the device
treated like one that had the "events" field set to a non-zero value
before. This applies only to the sd and sr drivers, which are changed to
set the new flags.

The new flags are DISK_EVENT_FLAG_POLL to enforce polling of the device
for synchronous events, and DISK_EVENT_FLAG_UEVENT to tell the
blocklayer to generate udev events from kernel events.

In order to add the event_flags field to struct gendisk, the events
field is converted to an "unsigned short"; it doesn't need to hold
values bigger than 2 anyway.

This patch doesn't change behavior.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-12 13:35:24 -06:00
Martin Wilck 673387a930 block: genhd: remove async_events field
The async_events field, intended to be used for drivers that support
asynchronous notifications about disk events (aka media change events),
isn't currently used by any driver, and apparently that has been that
way for a long time (if not forever). Remove it.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-12 13:35:22 -06:00
Keyur Patel dfc76d11dd block: Replace function name in string with __func__
Replace hard coded function name register_blkdev with __func__, to
improve robustness and to conform to the Linux kernel coding
style. Issue found using checkpatch.

Signed-off-by: Keyur Patel <iamkeyur96@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28 14:09:08 -07:00
zhengbin 4d7c1d3fd7 block: fix NULL pointer dereference in register_disk
If __device_add_disk-->bdi_register_owner-->bdi_register-->
bdi_register_va-->device_create_vargs fails, bdi->dev is still
NULL, __device_add_disk-->register_disk will visit bdi->dev->kobj.
This patch fixes that.

Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28 14:01:36 -07:00
Mikulas Patocka e016b78201 block: return just one value from part_in_flight
The previous patches deleted all the code that needed the second value
returned from part_in_flight - now the kernel only uses the first value.

Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
it only returns one value.

This patch just refactors the code, there's no functional change.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:38 -07:00
Mikulas Patocka 1226b8dd0e block: switch to per-cpu in-flight counters
Now when part_round_stats is gone, we can switch to per-cpu in-flight
counters.

We use the local-atomic type local_t, so that if part_inc_in_flight or
part_dec_in_flight is reentrantly called from an interrupt, the value will
be correct.

The other counters could be corrupted due to reentrant interrupt, but the
corruption only results in slight counter skew - the in_flight counter
must be exact, so it needs local_t.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Mikulas Patocka 5b18b5a737 block: delete part_round_stats and switch to less precise counting
We want to convert to per-cpu in_flight counters.

The function part_round_stats needs the in_flight counter every jiffy, it
would be too costly to sum all the percpu variables every jiffy, so it
must be deleted. part_round_stats is used to calculate two counters -
time_in_queue and io_ticks.

time_in_queue can be calculated without part_round_stats, by adding the
duration of the I/O when the I/O ends (the value is almost as exact as the
previously calculated value, except that time for in-progress I/Os is not
counted).

io_ticks can be approximated by increasing the value when I/O is started
or ended and the jiffies value has changed. If the I/Os take less than a
jiffy, the value is as exact as the previously calculated value. If the
I/Os take more than a jiffy, io_ticks can drift behind the previously
calculated value.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Mike Snitzer 112f158f66 block: stop passing 'cpu' to all percpu stats methods
All of part_stat_* and related methods are used with preempt disabled,
so there is no need to pass cpu around to allow of them.  Just call
smp_processor_id() as needed.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Jens Axboe 344e9ffcbd block: add queue_is_mq() helper
Various spots check for q->mq_ops being non-NULL, but provide
a helper to do this instead.

Where the ->mq_ops != NULL check is redundant, remove it.

Since mq == rq-based now that legacy is gone, get rid of the
queue_is_rq_based() and just use queue_is_mq() everywhere.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-16 08:34:06 -07:00
Jens Axboe c0aac682fa This is the 4.19-rc6 release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAluw4MIACgkQONu9yGCS
 aT7+8xAAiYnc4khUsxeInm3z44WPfRX1+UF51frTNSY5C8Nn5nvRSnTUNLuKkkrz
 8RbwCL6UYyJxF9I/oZdHPsPOD4IxXkQY55tBjz7ZbSBIFEwYM6RJMm8mAGlXY7wq
 VyWA5MhlpGHM9DjrguB4DMRipnrSc06CVAnC+ZyKLjzblzU1Wdf2dYu+AW9pUVXP
 j4r74lFED5djPY1xfqfzEwmYRCeEGYGx7zMqT3GrrF5uFPqj1H6O5klEsAhIZvdl
 IWnJTU2coC8R/Sd17g4lHWPIeQNnMUGIUbu+PhIrZ/lDwFxlocg4BvarPXEdzgYi
 gdZzKBfovpEsSu5RCQsKWG4IGQxY7I1p70IOP9eqEFHZy77qT1YcHVAWrK1Y/bJd
 UA08gUOSzRnhKkNR3+PsaMflUOl9WkpyHECZu394cyRGMutSS50aWkavJPJ/o1Qi
 D/oGqZLLcKFyuNcchG+Met1TzY3LvYEDgSburqwqeUZWtAsGs8kmiiq7qvmXx4zV
 IcgM8ERqJ8mbfhfsXQU7hwydIrPJ3JdIq19RnM5ajbv2Q4C/qJCyAKkQoacrlKR4
 aiow/qvyNrP80rpXfPJB8/8PiWeDtAnnGhM+xySZNlw3t8GR6NYpUkIzf5TdkSb3
 C8KuKg6FY9QAS62fv+5KK3LB/wbQanxaPNruQFGe5K1iDQ5Fvzw=
 =dMl4
 -----END PGP SIGNATURE-----

Merge tag 'v4.19-rc6' into for-4.20/block

Merge -rc6 in, for two reasons:

1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
   they aren't in the 4.20 branch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

* tag 'v4.19-rc6': (780 commits)
  Linux 4.19-rc6
  MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
  cpufreq: qcom-kryo: Fix section annotations
  perf/core: Add sanity check to deal with pinned event failure
  xen/blkfront: correct purging of persistent grants
  Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
  selftests/powerpc: Fix Makefiles for headers_install change
  blk-mq: I/O and timer unplugs are inverted in blktrace
  dax: Fix deadlock in dax_lock_mapping_entry()
  x86/boot: Fix kexec booting failure in the SEV bit detection code
  bcache: add separate workqueue for journal_write to avoid deadlock
  drm/amd/display: Fix Edid emulation for linux
  drm/amd/display: Fix Vega10 lightup on S3 resume
  drm/amdgpu: Fix vce work queue was not cancelled when suspend
  Revert "drm/panel: Add device_link from panel device to DRM device"
  xen/blkfront: When purging persistent grants, keep them in the buffer
  clocksource/drivers/timer-atmel-pit: Properly handle error cases
  block: fix deadline elevator drain for zoned block devices
  ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
  drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-01 08:58:57 -06:00
Hannes Reinecke fef912bf86 block: genhd: add 'groups' argument to device_add_disk
Update device_add_disk() to take an 'groups' argument so that
individual drivers can register a device with additional sysfs
attributes.
This avoids race condition the driver would otherwise have if these
groups were to be created with sysfs_add_groups().

Signed-off-by: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28 08:30:28 -06:00
Omar Sandoval b57e99b4b8 block: use nanosecond resolution for iostat
Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.

Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.

Fixes: 522a777566 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: Klaus Kusche <klaus.kusche@computerix.info>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:26:59 -06:00
Michael Callahan bdca3c87fb block: Track DISCARD statistics and output them in stat and diskstat
Add tracking of REQ_OP_DISCARD ios to the partition statistics and
append them to the various stat files in /sys as well as
/proc/diskstats.  These are tracked with the same four stats as reads
and writes:

Number of discard ios completed.
Number of discard ios merged
Number of discard sectors completed
Milliseconds spent on discard requests

This is done via adding a new STAT_DISCARD define to genhd.h and then
using it to index that stat field for discard requests.

tj: Refreshed on top of v4.17 and other previous updates.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:22 -06:00
Michael Callahan dbae2c5513 block: Define and use STAT_READ and STAT_WRITE
Add defines for STAT_READ and STAT_WRITE for indexing the partition
stat entries. This clarifies some fs/ code which has hardcoded 1 for
STAT_WRITE and will make it easier to extend the stats with additional
fields.

tj: Refreshed on top of v4.17.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:18 -06:00
Linus Torvalds cf626b0da7 Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull procfs updates from Al Viro:
 "Christoph's proc_create_... cleanups series"

* 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits)
  xfs, proc: hide unused xfs procfs helpers
  isdn/gigaset: add back gigaset_procinfo assignment
  proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields
  tty: replace ->proc_fops with ->proc_show
  ide: replace ->proc_fops with ->proc_show
  ide: remove ide_driver_proc_write
  isdn: replace ->proc_fops with ->proc_show
  atm: switch to proc_create_seq_private
  atm: simplify procfs code
  bluetooth: switch to proc_create_seq_data
  netfilter/x_tables: switch to proc_create_seq_private
  netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data
  neigh: switch to proc_create_seq_data
  hostap: switch to proc_create_{seq,single}_data
  bonding: switch to proc_create_seq_data
  rtc/proc: switch to proc_create_single_data
  drbd: switch to proc_create_single
  resource: switch to proc_create_seq_data
  staging/rtl8192u: simplify procfs code
  jfs: simplify procfs code
  ...
2018-06-04 10:00:01 -07:00
Joe Perches 5657a819a8 block drivers/block: Use octal not symbolic permissions
Convert the S_<FOO> symbolic permissions to their octal equivalents as
using octal and not symbolic permissions is preferred by many as more
readable.

see: https://lkml.org/lkml/2016/8/2/1945

Done with automated conversion via:
$ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace <files...>

Miscellanea:

o Wrapped modified multi-line calls to a single line where appropriate
o Realign modified multi-line calls to open parenthesis

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-24 13:38:59 -06:00
Christoph Hellwig fddda2b7b5 proc: introduce proc_create_seq{,_data}
Variants of proc_create{,_data} that directly take a struct seq_operations
argument and drastically reduces the boilerplate code in the callers.

All trivial callers converted over.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
Omar Sandoval bf0ddaba65 blk-mq: fix sysfs inflight counter
When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.

Fixes: f299b7c7a9 ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-26 09:02:01 -06:00
Srivatsa S. Bhat f33ff110ef block, char_dev: Use correct format specifier for unsigned ints
register_blkdev() and __register_chrdev_region() treat the major
number as an unsigned int. So print it the same way to avoid
absurd error statements such as:
"... major requested (-1) is greater than the maximum (511) ..."
(and also fix off-by-one bugs in the error prints).

While at it, also update the comment describing register_blkdev().

Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15 17:59:24 +01:00
Jan Kara 56c0908c85 genhd: Fix BUG in blkdev_open()
When two blkdev_open() calls for a partition race with device removal
and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
blkdev_open(). The race can happen as follows:

CPU0				CPU1			CPU2
							del_gendisk()
							  bdev_unhash_inode(part1);

blkdev_open(part1, O_EXCL)	blkdev_open(part1, O_EXCL)
  bdev = bd_acquire()		  bdev = bd_acquire()
  blkdev_get(bdev)
    bd_start_claiming(bdev)
      - finds old inode 'whole'
      bd_prepare_to_claim() -> 0
							  bdev_unhash_inode(whole);
							<device removed>
							<new device under same
							 number created>
				  blkdev_get(bdev);
				    bd_start_claiming(bdev)
				      - finds new inode 'whole'
				      bd_prepare_to_claim()
					- this also succeeds as we have
					  different 'whole' here...
					- bad things happen now as we
					  have two exclusive openers of
					  the same bdev

The problem here is that block device opens can see various intermediate
states while gendisk is shutting down and then being recreated.

We fix the problem by introducing new lookup_sem in gendisk that
synchronizes gendisk deletion with get_gendisk() and furthermore by
making sure that get_gendisk() does not return gendisk that is being (or
has been) deleted. This makes sure that once we ever manage to look up
newly created bdev inode, we are also guaranteed that following
get_gendisk() will either return failure (and we fail open) or it
returns gendisk for the new device and following bdget_disk() will
return new bdev inode (i.e., blkdev_open() follows the path as if it is
completely run after new device is created).

Reported-and-analyzed-by: Hou Tao <houtao1@huawei.com>
Tested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26 09:48:42 -07:00
Jan Kara 9df6c29912 genhd: Add helper put_disk_and_module()
Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26 09:48:42 -07:00
Jan Kara 3079c22ea8 genhd: Rename get_disk() to get_disk_and_module()
Rename get_disk() to get_disk_and_module() to make sure what the
function does. It's not a great name but at least it is now clear that
put_disk() is not it's counterpart.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26 09:48:42 -07:00
Jan Kara d52987b524 genhd: Fix leaked module reference for NVME devices
Commit 8ddcd65325 "block: introduce GENHD_FL_HIDDEN" added handling of
hidden devices to get_gendisk() but forgot to drop module reference
which is also acquired by get_disk(). Drop the reference as necessary.

Arguably the function naming here is misleading as put_disk() is *not*
the counterpart of get_disk() but let's fix that in the follow up
commit since that will be more intrusive.

Fixes: 8ddcd65325
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26 09:48:42 -07:00
Mike Snitzer fa70d2e2c4 block: allow gendisk's request_queue registration to be deferred
Since I can remember DM has forced the block layer to allow the
allocation and initialization of the request_queue to be distinct
operations.  Reason for this is block/genhd.c:add_disk() has requires
that the request_queue (and associated bdi) be tied to the gendisk
before add_disk() is called -- because add_disk() also deals with
exposing the request_queue via blk_register_queue().

DM's dynamic creation of arbitrary device types (and associated
request_queue types) requires the DM device's gendisk be available so
that DM table loads can establish a master/slave relationship with
subordinate devices that are referenced by loaded DM tables -- using
bd_link_disk_holder().  But until these DM tables, and their associated
subordinate devices, are known DM cannot know what type of request_queue
it needs -- nor what its queue_limits should be.

This chicken and egg scenario has created all manner of problems for DM
and, at times, the block layer.

Summary of changes:

- Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
  that drivers may use to add a disk without also calling
  blk_register_queue().  Driver must call blk_register_queue() once its
  request_queue is fully initialized.

- Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
  is not set.  It won't be set if driver used add_disk_no_queue_reg()
  but driver encounters an error and must del_gendisk() before calling
  blk_register_queue().

- Export blk_register_queue().

These changes allow DM to use add_disk_no_queue_reg() to anchor its
gendisk as the "master" for master/slave relationships DM must establish
with subordinate devices referenced in DM tables that get loaded.  Once
all "slave" devices for a DM device are known its request_queue can be
properly initialized and then advertised via sysfs -- important
improvement being that no request_queue resource initialization
performed by blk_register_queue() is missed for DM devices anymore.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-15 08:41:38 -07:00
Mike Snitzer bc8d062c36 block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN
device_add_disk() will only call bdi_register_owner() if
!GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
bdi_unregister() if !GENHD_FL_HIDDEN.

Found with code inspection.  bdi_unregister() won't do any harm if
bdi_register_owner() wasn't used but best to avoid the unnecessary
call to bdi_unregister().

Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-15 08:41:38 -07:00
Randy Dunlap 7fb526212f block: genhd.c: fix message typo
Fix typo in error message.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-19 11:02:19 -07:00
weiping zhang 3a92168bc8 block: add WARN_ON if bdi register fail
device_add_disk need do more safety error handle, so this patch just
add WARN_ON.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>

Adapted for current series by me.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-19 11:02:13 -07:00
Linus Torvalds e2c5923c34 Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block
Pull core block layer updates from Jens Axboe:
 "This is the main pull request for block storage for 4.15-rc1.

  Nothing out of the ordinary in here, and no API changes or anything
  like that. Just various new features for drivers, core changes, etc.
  In particular, this pull request contains:

   - A patch series from Bart, closing the whole on blk/scsi-mq queue
     quescing.

   - A series from Christoph, building towards hidden gendisks (for
     multipath) and ability to move bio chains around.

   - NVMe
        - Support for native multipath for NVMe (Christoph).
        - Userspace notifications for AENs (Keith).
        - Command side-effects support (Keith).
        - SGL support (Chaitanya Kulkarni)
        - FC fixes and improvements (James Smart)
        - Lots of fixes and tweaks (Various)

   - bcache
        - New maintainer (Michael Lyle)
        - Writeback control improvements (Michael)
        - Various fixes (Coly, Elena, Eric, Liang, et al)

   - lightnvm updates, mostly centered around the pblk interface
     (Javier, Hans, and Rakesh).

   - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

   - Writeback series that fix the much discussed hundreds of millions
     of sync-all units. This goes all the way, as discussed previously
     (me).

   - Fix for missing wakeup on writeback timer adjustments (Yafang
     Shao).

   - Fix laptop mode on blk-mq (me).

   - {mq,name} tupple lookup for IO schedulers, allowing us to have
     alias names. This means you can use 'deadline' on both !mq and on
     mq (where it's called mq-deadline). (me).

   - blktrace race fix, oopsing on sg load (me).

   - blk-mq optimizations (me).

   - Obscure waitqueue race fix for kyber (Omar).

   - NBD fixes (Josef).

   - Disable writeback throttling by default on bfq, like we do on cfq
     (Luca Miccio).

   - Series from Ming that enable us to treat flush requests on blk-mq
     like any other request. This is a really nice cleanup.

   - Series from Ming that improves merging on blk-mq with schedulers,
     getting us closer to flipping the switch on scsi-mq again.

   - BFQ updates (Paolo).

   - blk-mq atomic flags memory ordering fixes (Peter Z).

   - Loop cgroup support (Shaohua).

   - Lots of minor fixes from lots of different folks, both for core and
     driver code"

* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
  nvme: fix visibility of "uuid" ns attribute
  blk-mq: fixup some comment typos and lengths
  ide: ide-atapi: fix compile error with defining macro DEBUG
  blk-mq: improve tag waiting setup for non-shared tags
  brd: remove unused brd_mutex
  blk-mq: only run the hardware queue if IO is pending
  block: avoid null pointer dereference on null disk
  fs: guard_bio_eod() needs to consider partitions
  xtensa/simdisk: fix compile error
  nvme: expose subsys attribute to sysfs
  nvme: create 'slaves' and 'holders' entries for hidden controllers
  block: create 'slaves' and 'holders' entries for hidden gendisks
  nvme: also expose the namespace identification sysfs files for mpath nodes
  nvme: implement multipath access to nvme subsystems
  nvme: track shared namespaces
  nvme: introduce a nvme_ns_ids structure
  nvme: track subsystems
  block, nvme: Introduce blk_mq_req_flags_t
  block, scsi: Make SCSI quiesce and resume work reliably
  block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
  ...
2017-11-14 15:32:19 -08:00
Colin Ian King f0fba398fe block: avoid null pointer dereference on null disk
It is possible that the pointer disk can be null and hence
we can get a null pointer deference when accessing disk->flags.
Add a null pointer check to avoid the dereference.

Detected by CoverityScan, CID#1461133 ("Explicit null dereferenced")

Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:55:57 -07:00
Hannes Reinecke 17eac09963 block: create 'slaves' and 'holders' entries for hidden gendisks
When creating nvme multipath devices we should populate the 'slaves' and
'holders' directorys properly to aid userspace topology detection.

Signed-off-by: Hannes Reinecke <hare@suse.com>
[hch: split from a larger patch]
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig 8ddcd65325 block: introduce GENHD_FL_HIDDEN
With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device.  This will be useful for the NVMe multipath
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig 517bf3c306 block: don't look at the struct device dev_t in disk_devt
The hidden gendisks introduced in the next patch need to keep the dev
field in their struct device empty so that udev won't try to create
block device nodes for them.  To support that rewrite disk_devt to
look at the major and first_minor fields in the gendisk itself instead
of looking into the struct device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Byungchul Park e319e1fbd9 block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion()
Darrick posted the following warning and Dave Chinner analyzed it:

> ======================================================
> WARNING: possible circular locking dependency detected
> 4.14.0-rc1-fixes #1 Tainted: G        W
> ------------------------------------------------------
> loop0/31693 is trying to acquire lock:
>  (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
>
> but now in release context of a crosslock acquired at the following:
>  ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 ((complete)&ret.event){+.+.}:
>        lock_acquire+0xab/0x200
>        wait_for_completion_io+0x4e/0x1a0
>        submit_bio_wait+0x7f/0xb0
>        blkdev_issue_zeroout+0x71/0xa0
>        xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
>        xfs_bmapi_write+0x374/0x11f0 [xfs]
>        xfs_iomap_write_direct+0x2ac/0x430 [xfs]
>        xfs_file_iomap_begin+0x20d/0xd50 [xfs]
>        iomap_apply+0x43/0xe0
>        dax_iomap_rw+0x89/0xf0
>        xfs_file_dax_write+0xcc/0x220 [xfs]
>        xfs_file_write_iter+0xf0/0x130 [xfs]
>        __vfs_write+0xd9/0x150
>        vfs_write+0xc8/0x1c0
>        SyS_write+0x45/0xa0
>        entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #1 (&xfs_nondir_ilock_class){++++}:
>        lock_acquire+0xab/0x200
>        down_write_nested+0x4a/0xb0
>        xfs_ilock+0x263/0x330 [xfs]
>        xfs_setattr_size+0x152/0x370 [xfs]
>        xfs_vn_setattr+0x6b/0x90 [xfs]
>        notify_change+0x27d/0x3f0
>        do_truncate+0x5b/0x90
>        path_openat+0x237/0xa90
>        do_filp_open+0x8a/0xf0
>        do_sys_open+0x11c/0x1f0
>        entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
>        up_write+0x1c/0x40
>        xfs_iunlock+0x1d0/0x310 [xfs]
>        xfs_file_fallocate+0x8a/0x310 [xfs]
>        loop_queue_work+0xb7/0x8d0
>        kthread_worker_fn+0xb9/0x1f0
>
> Chain exists of:
>   &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
>
>  Possible unsafe locking scenario by crosslock:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&xfs_nondir_ilock_class);
>   lock((complete)&ret.event);
>                                lock(&(&ip->i_mmaplock)->mr_lock);
>                                unlock((complete)&ret.event);
>
>                *** DEADLOCK ***

The warning is a false positive, caused by the fact that all
wait_for_completion()s in submit_bio_wait() are waiting with the same
lock class.

However, some bios have nothing to do with others, for example in the case
of loop devices, there's no direct connection between the bios of an upper
device and the bios of a lower device(=loop device).

The safest way to assign different lock classes to different devices is
to do it for each gendisk. In other words, this patch assigns a
lockdep_map per gendisk and uses it when initializing completion in
submit_bio_wait().

Analyzed-by: Dave Chinner <david@fromorbit.com>
Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: amir73il@gmail.com
Cc: axboe@kernel.dk
Cc: david@fromorbit.com
Cc: hch@infradead.org
Cc: idryomov@gmail.com
Cc: johan@kernel.org
Cc: johannes.berg@intel.com
Cc: kernel-team@lge.com
Cc: linux-block@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Cc: oleg@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1508921765-15396-10-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-26 07:54:17 +02:00
Linus Torvalds a0725ab0c7 Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the first pull request for 4.14, containing most of the code
  changes. It's a quiet series this round, which I think we needed after
  the churn of the last few series. This contains:

   - Fix for a registration race in loop, from Anton Volkov.

   - Overflow complaint fix from Arnd for DAC960.

   - Series of drbd changes from the usual suspects.

   - Conversion of the stec/skd driver to blk-mq. From Bart.

   - A few BFQ improvements/fixes from Paolo.

   - CFQ improvement from Ritesh, allowing idling for group idle.

   - A few fixes found by Dan's smatch, courtesy of Dan.

   - A warning fixup for a race between changing the IO scheduler and
     device remova. From David Jeffery.

   - A few nbd fixes from Josef.

   - Support for cgroup info in blktrace, from Shaohua.

   - Also from Shaohua, new features in the null_blk driver to allow it
     to actually hold data, among other things.

   - Various corner cases and error handling fixes from Weiping Zhang.

   - Improvements to the IO stats tracking for blk-mq from me. Can
     drastically improve performance for fast devices and/or big
     machines.

   - Series from Christoph removing bi_bdev as being needed for IO
     submission, in preparation for nvme multipathing code.

   - Series from Bart, including various cleanups and fixes for switch
     fall through case complaints"

* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
  kernfs: checking for IS_ERR() instead of NULL
  drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
  drbd: Fix allyesconfig build, fix recent commit
  drbd: switch from kmalloc() to kmalloc_array()
  drbd: abort drbd_start_resync if there is no connection
  drbd: move global variables to drbd namespace and make some static
  drbd: rename "usermode_helper" to "drbd_usermode_helper"
  drbd: fix race between handshake and admin disconnect/down
  drbd: fix potential deadlock when trying to detach during handshake
  drbd: A single dot should be put into a sequence.
  drbd: fix rmmod cleanup, remove _all_ debugfs entries
  drbd: Use setup_timer() instead of init_timer() to simplify the code.
  drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
  drbd: new disk-option disable-write-same
  drbd: Fix resource role for newly created resources in events2
  drbd: mark symbols static where possible
  drbd: Send P_NEG_ACK upon write error in protocol != C
  drbd: add explicit plugging when submitting batches
  drbd: change list_for_each_safe to while(list_first_entry_or_null)
  drbd: introduce drbd_recv_header_maybe_unplug
  ...
2017-09-07 11:59:42 -07:00
Christoph Hellwig 807d4af2f6 block: add a __disk_get_part helper
This helper allows looking up a partion under RCU protection without
grabbing a reference to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:52 -06:00