Commit Graph

300 Commits

Author SHA1 Message Date
Jens Axboe da521626ac bio: optimize initialization of a bio
The memset() used is measurably slower in targeted benchmarks, wasting
about 1% of the total runtime, or 50% of the (later) hot path cached
bio alloc. Get rid of it and fill in the bio manually.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:44:02 -06:00
Edward Hsieh 60b6a7e6a0 block: fix trace completion for chained bio
For chained bio, trace_block_bio_complete in bio_endio is currently called
only by the parent bio once upon all chained bio completed.
However, the sector and size for the parent bio are modified in bio_split.
Therefore, the size and sector of the complete events might not match the
queue events in blktrace.

The original fix of bio completion trace <fbbaf700e7b1> ("block: trace
completion of all bios.") wants multiple complete events to correspond
to one queue event but missed this.

The issue can be reproduced by md/raid5 read with bio cross chunks.

To fix, move trace completion into the loop for every chained bio to call.

Fixes: fbbaf700e7 ("block: trace completion of all bios.")
Reviewed-by: Wade Liang <wadel@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Edward Hsieh <edwardh@synology.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210624123030.27014-1-edwardh@synology.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24 09:53:50 -06:00
Jens Axboe 35c820e715 Revert "bio: limit bio max size"
This reverts commit cd2c7545ae.

Alex reports that the commit causes corruption with LUKS on ext4. Revert
it for now so that this can be investigated properly.

Link: https://lore.kernel.org/linux-block/1620493841.bxdq8r5haw.none@localhost/
Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-08 21:49:48 -06:00
Changheun Lee cd2c7545ae bio: limit bio max size
bio size can grow up to 4GB when muli-page bvec is enabled.
but sometimes it would lead to inefficient behaviors.
in case of large chunk direct I/O, - 32MB chunk read in user space -
all pages for 32MB would be merged to a bio structure if the pages
physical addresses are contiguous. it makes some delay to submit
until merge complete. bio max size should be limited to a proper size.

When 32MB chunk read with direct I/O option is coming from userspace,
kernel behavior is below now in do_direct_IO() loop. it's timeline.

 | bio merge for 32MB. total 8,192 pages are merged.
 | total elapsed time is over 2ms.
 |------------------ ... ----------------------->|
                                                 | 8,192 pages merged a bio.
                                                 | at this time, first bio submit is done.
                                                 | 1 bio is split to 32 read request and issue.
                                                 |--------------->
                                                  |--------------->
                                                   |--------------->
                                                              ......
                                                                   |--------------->
                                                                    |--------------->|
                          total 19ms elapsed to complete 32MB read done from device. |

If bio max size is limited with 1MB, behavior is changed below.

 | bio merge for 1MB. 256 pages are merged for each bio.
 | total 32 bio will be made.
 | total elapsed time is over 2ms. it's same.
 | but, first bio submit timing is fast. about 100us.
 |--->|--->|--->|---> ... -->|--->|--->|--->|--->|
      | 256 pages merged a bio.
      | at this time, first bio submit is done.
      | and 1 read request is issued for 1 bio.
      |--------------->
           |--------------->
                |--------------->
                                      ......
                                                 |--------------->
                                                  |--------------->|
        total 17ms elapsed to complete 32MB read done from device. |

As a result, read request issue timing is faster if bio max size is limited.
Current kernel behavior with multipage bvec, super large bio can be created.
And it lead to delay first I/O request issue.

Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210503095203.29076-1-nanich.lee@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-03 11:00:11 -06:00
Linus Torvalds 6c00292113 for-5.13/block-2021-04-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIJW0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr8sD/4qP+MsFTB1IFUu8fW7BjBPdduoK8Vq9o3S
 HB8iF/yhJZ73nLecMMdn/jTO8SCW0Iw+okywW3BugGnNPbwXo0UQ4jLhzbTts76P
 JvZaguZFhBsF3ceFOt3CRCQDOeoDfMp3sitLUVivkN+2vwMs9vJpVNaEeUjcCC1Z
 8QjlpqYSMuakTwEn7QhlnKxVWn1V2B6PDjZMcf48ONRZGsCkoOXH1SE4Ge8nxjqa
 KHKO5bvwgRzGhKpvdHEIl8dmFL9WEWElBVoY3vE2EHL0SPE32zHlxtYLS0NAhY2M
 aprkJ0QP0Rgl8HpYiCstwAnJGKDg4a0ArWhf/CJTuLAWmTNFR7v5n7vw2SilJHTG
 0FtiFiOnpvvBmUC0B1PUEQX8AiFcdXueLb6xboExcp2WtxIAe8wPoGFl6T1tobBY
 qsfWggGs/vD1RVrJISPC+20cJemcRyeakMV48w+n3Lt/ES3IEv/LXx6PO/PbXvOo
 B7HJXTofkoaX52A/1+NxraGapwzhYouhi6Sb6Fc++X59/a/oBuOUGuur0eZ+/oWA
 9787mUUDmW/sahfZUgZh5AxqKo2jJULjeggANCICW9/RN6duV8TBQVOLW1/0Wddp
 9lndiA9ZMveWF+J19+sjBoiYMYawLmURaOlDK77ctTCcR/ji3l4GZ+2KvBEMeIT8
 O1OYEnwaIQ==
 =oza6
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Pretty quiet round this time, which is nice. In detail:

   - Series revamping bounce buffer support (Christoph)

   - Dead code removal (Christoph, Bart)

   - Partition iteration revamp, now using xarray (Christoph)

   - Passthrough request scheduler improvements (Lin)

   - Series of BFQ improvements (Paolo)

   - Fix ioprio task iteration (Peter)

   - Various little tweaks and fixes (Tejun, Saravanan, Bhaskar, Max,
     Nikolay)"

* tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block: (41 commits)
  blk-iocost: don't ignore vrate_min on QD contention
  blk-mq: Fix spurious debugfs directory creation during initialization
  bfq/mq-deadline: remove redundant check for passthrough request
  blk-mq: bypass IO scheduler's limit_depth for passthrough request
  block: Remove an obsolete comment from sg_io()
  block: move bio_list_copy_data to pktcdvd
  block: remove zero_fill_bio_iter
  block: add queue_to_disk() to get gendisk from request_queue
  block: remove an incorrect check from blk_rq_append_bio
  block: initialize ret in bdev_disk_changed
  block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration
  block: remove disk_part_iter
  block: simplify diskstats_show
  block: simplify show_partition
  block: simplify printk_all_partitions
  block: simplify partition_overlaps
  block: simplify partition removal
  block: take bd_mutex around delete_partitions in del_gendisk
  block: refactor blk_drop_partitions
  block: move more syncing and invalidation to delete_partition
  ...
2021-04-28 14:27:12 -07:00
Christoph Hellwig 5f03414d40 block: move bio_list_copy_data to pktcdvd
bio_list_copy_data is only used by pktcdvd, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210412134658.2623190-2-hch@lst.de
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12 09:19:58 -06:00
Christoph Hellwig 6f822e1b5d block: remove zero_fill_bio_iter
zero_fill_bio_iter is only used to implement zero_fill_bio, so
remove the indirection.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210412134658.2623190-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-12 09:19:43 -06:00
Yufen Yu 3edf5346e4 block: only update parent bi_status when bio fail
For multiple split bios, if one of the bio is fail, the whole
should return error to application. But we found there is a race
between bio_integrity_verify_fn and bio complete, which return
io success to application after one of the bio fail. The race as
following:

split bio(READ)          kworker

nvme_complete_rq
blk_update_request //split error=0
  bio_endio
    bio_integrity_endio
      queue_work(kintegrityd_wq, &bip->bip_work);

                         bio_integrity_verify_fn
                         bio_endio //split bio
                          __bio_chain_endio
                             if (!parent->bi_status)

                               <interrupt entry>
                               nvme_irq
                                 blk_update_request //parent error=7
                                 req_bio_endio
                                    bio->bi_status = 7 //parent bio
                               <interrupt exit>

                               parent->bi_status = 0
                        parent->bi_end_io() // return bi_status=0

The bio has been split as two: split and parent. When split
bio completed, it depends on kworker to do endio, while
bio_integrity_verify_fn have been interrupted by parent bio
complete irq handler. Then, parent bio->bi_status which have
been set in irq handler will overwrite by kworker.

In fact, even without the above race, we also need to conside
the concurrency beteen mulitple split bio complete and update
the same parent bi_status. Normally, multiple split bios will
be issued to the same hctx and complete from the same irq
vector. But if we have updated queue map between multiple split
bios, these bios may complete on different hw queue and different
irq vector. Then the concurrency update parent bi_status may
cause the final status error.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210331115359.1125679-1-yuyufen@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-31 19:18:04 -06:00
Johannes Thumshirn 7de55b7d6f block: support zone append bvecs
Christoph reported that we'll likely trigger the WARN_ON_ONCE() checking
that we're not submitting a bvec with REQ_OP_ZONE_APPEND in
bio_iov_iter_get_pages() some time ago using zoned btrfs, but I couldn't
reproduce it back then.

Now Naohiro was able to trigger the bug as well with xfstests generic/095
on a zoned btrfs.

There is nothing that prevents bvec submissions via REQ_OP_ZONE_APPEND if
the hardware's zone append limit is met.

Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/10bd414d9326c90cd69029077db63b363854eee5.1616600835.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-24 11:36:51 -06:00
Christoph Hellwig a8affc03a9 block: rename BIO_MAX_PAGES to BIO_MAX_VECS
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been
horribly confusingly misnamed.  Rename it to BIO_MAX_VECS to stop
confusing users of the bio API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11 07:47:48 -07:00
Linus Torvalds 582cd91f69 for-5.12/block-2021-02-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP
 EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+
 RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt
 Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK
 dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw
 ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg
 rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u
 ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l
 Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ
 wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC
 VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26
 WC22RGC2MA==
 =os1E
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Another nice round of removing more code than what is added, mostly
  due to Christoph's relentless pursuit of tech debt removal/cleanups.
  This pull request contains:

   - Two series of BFQ improvements (Paolo, Jan, Jia)

   - Block iov_iter improvements (Pavel)

   - bsg error path fix (Pan)

   - blk-mq scheduler improvements (Jan)

   - -EBUSY discard fix (Jan)

   - bvec allocation improvements (Ming, Christoph)

   - bio allocation and init improvements (Christoph)

   - Store bdev pointer in bio instead of gendisk + partno (Christoph)

   - Block trace point cleanups (Christoph)

   - hard read-only vs read-only split (Christoph)

   - Block based swap cleanups (Christoph)

   - Zoned write granularity support (Damien)

   - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"

* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
  mm: simplify swapdev_block
  sd_zbc: clear zone resources for non-zoned case
  block: introduce blk_queue_clear_zone_settings()
  zonefs: use zone write granularity as block size
  block: introduce zone_write_granularity limit
  block: use blk_queue_set_zoned in add_partition()
  nullb: use blk_queue_set_zoned() to setup zoned devices
  nvme: cleanup zone information initialization
  block: document zone_append_max_bytes attribute
  block: use bi_max_vecs to find the bvec pool
  md/raid10: remove dead code in reshape_request
  block: mark the bio as cloned in bio_iov_bvec_set
  block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
  block: remove a layer of indentation in bio_iov_iter_get_pages
  block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
  block: remove the 1 and 4 vec bvec_slabs entries
  block: streamline bvec_alloc
  block: factor out a bvec_alloc_gfp helper
  block: move struct biovec_slab to bio.c
  block: reuse BIO_INLINE_VECS for integrity bvecs
  ...
2021-02-21 11:02:48 -08:00
Johannes Thumshirn ae29333fa6 block: add bio_add_zone_append_page
Add bio_add_zone_append_page(), a wrapper around bio_add_hw_page() which
is intended to be used by file systems that directly add pages to a bio
instead of using bio_iov_iter_get_pages().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 00:52:19 +01:00
Christoph Hellwig 7a800a20ae block: use bi_max_vecs to find the bvec pool
Instead of encoding of the bvec pool using magic bio flags, just use
a helper to find the pool based on the max_vecs value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08 08:33:16 -07:00
Christoph Hellwig 977be01273 block: mark the bio as cloned in bio_iov_bvec_set
bio_iov_bvec_set clones the bio_vecs from the iter, and thus should be
treated like a cloned bio in every respect.  That also includes not
touching bi_max_vecs as that is a property of the bio allocation and not
its current payload.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08 08:33:16 -07:00
Christoph Hellwig ed97ce5e1d block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
bio_iov_bvec_set assigns the foreign bvec, so setting the NO_PAGE_REF
directly there seems like the best fit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08 08:33:16 -07:00
Christoph Hellwig 86004515ed block: remove a layer of indentation in bio_iov_iter_get_pages
Remove a pointless layer of indentation after a return statement.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08 08:33:16 -07:00
Christoph Hellwig 0f2e6ab851 block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
The bi_max_vecs and bi_vcnt fields are defined as unsigned short, so
don't allow passing larger values in.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08 08:33:16 -07:00
Christoph Hellwig de76fd8930 block: remove the 1 and 4 vec bvec_slabs entries
All bios with up to 4 bvecs use the inline bvecs in the bio itself, so
don't bother to define bvec_slabs entries for them.  Also decruftify
the bvec_slabs definition and initialization while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08 08:33:16 -07:00
Christoph Hellwig f007a3d66c block: streamline bvec_alloc
Avoid the pointless goto by trying the slab allocation first and falling
through to the mempool.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08 08:33:16 -07:00
Christoph Hellwig f2c3eb9bb0 block: factor out a bvec_alloc_gfp helper
Clean up bvec_alloc a little by factoring out a helper for the gfp_t
manipulations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08 08:33:16 -07:00
Christoph Hellwig 6ac0b71537 block: move struct biovec_slab to bio.c
struct biovec_slab is only used inside of bio.c, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08 08:33:16 -07:00
Christoph Hellwig dc0b8a57ad block: reuse BIO_INLINE_VECS for integrity bvecs
bvec_alloc always uses biovec_slabs, and thus always needs to use the
same number of inline vecs.  Share a single definition for the data
and integrity bvecs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-08 08:33:15 -07:00
Ming Lei 8358c28a5d block: fix memory leak of bvec
bio_init() clears bio instance, so the bvec index has to be set after
bio_init(), otherwise bio->bi_io_vec may be leaked.

Fixes: 3175199ab0 ("block: split bio_kmalloc from bio_alloc_bioset")
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-02 08:57:56 -07:00
Christoph Hellwig 3175199ab0 block: split bio_kmalloc from bio_alloc_bioset
bio_kmalloc shares almost no logic with the bio_set based fast path
in bio_alloc_bioset.  Split it into an entirely separate implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27 09:51:48 -07:00
Christoph Hellwig 46bbf653a6 block: inherit BIO_REMAPPED when cloning bios
Cloned bios are can be used to on the same device, in which case we need
to inherit the BIO_REMAPPED flag to avoid a double partition remap.  When
the cloned bios are used on another device, bio_set_dev will clear the flag.

Fixes: 309dca309f ("block: store a block_device pointer in struct bio")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-26 08:50:01 -07:00
Pavel Begunkov c42bca92be bio: don't copy bvec for direct IO
The block layer spends quite a while in blkdev_direct_IO() to copy and
initialise bio's bvec. However, if we've already got a bvec in the input
iterator it might be reused in some cases, i.e. when new
ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable
performance boost, and it also reduces memory footprint.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25 08:58:24 -07:00
Pavel Begunkov 0cf41e5e9b block/psi: remove PSI annotations from direct IO
Direct IO does not operate on the current working set of pages managed
by the kernel, so it should not be accounted as memory stall to PSI
infrastructure.

The block layer and iomap direct IO use bio_iov_iter_get_pages()
to build bios, and they are the only users of it, so to avoid PSI
tracking for them clear out BIO_WORKINGSET flag. Do same for
dio_bio_submit() because fs/direct_io constructs bios by hand directly
calling bio_add_page().

Reported-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25 08:58:24 -07:00
Ming Lei baa2c7c971 block: set .bi_max_vecs as actual allocated vector number
bvec_alloc() may allocate more bio vectors than requested, so set
.bi_max_vecs as actual allocated vector number, instead of the requested
number. This way can help fs build bigger bio because new bio often won't
be allocated until the current one becomes full.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 21:22:45 -07:00
Ming Lei 9f180e315a block: don't allocate inline bvecs if this bioset needn't bvecs
The inline bvecs won't be used if user needn't bvecs by not passing
BIOSET_NEED_BVECS, so don't allocate bvecs in this situation.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 21:22:45 -07:00
Ming Lei 49d1ec8573 block: manage bio slab cache by xarray
Managing bio slab cache via xarray by using slab cache size as xarray
index, and storing 'struct bio_slab' instance into xarray.

So code is simplified a lot, meantime it becomes more readable than before.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 21:22:45 -07:00
Christoph Hellwig 309dca309f block: store a block_device pointer in struct bio
Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device.  From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 18:17:20 -07:00
Pavel Begunkov 22b56c2964 bio: optimise bvec iteration
__bio_for_each_bvec(), __bio_for_each_segment() and bio_copy_data_iter()
fall under conditions of bvec_iter_advance_single(), which is a faster
and slimmer version of bvec_iter_advance(). Add
bio_advance_iter_single() and convert them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02 09:46:55 -07:00
Christoph Hellwig 8446fe9255 block: switch partition lookup to use struct block_device
Use struct block_device to lookup partitions on a disk.  This removes
all usage of struct hd_struct from the I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de>			[bcache]
Acked-by: Chao Yu <yuchao0@huawei.com>			[f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig a782483cc1 block: remove the nr_sects field in struct hd_struct
Now that the hd_struct always has a block device attached to it, there is
no need for having two size field that just get out of sync.

Additionally the field in hd_struct did not use proper serialization,
possibly allowing for torn writes.  By only using the block_device field
this problem also gets fixed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de>			[bcache]
Acked-by: Chao Yu <yuchao0@huawei.com>			[f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Naohiro Aota 4977d121bc block: advance iov_iter on bio_add_hw_page failure
When the bio's size reaches max_append_sectors, bio_add_hw_page returns
0 then __bio_iov_append_get_pages returns -EINVAL. This is an expected
result of building a small enough bio not to be split in the IO path.
However, iov_iter is not advanced in this case, causing the same pages
are filled for the bio again and again.

Fix the case by properly advancing the iov_iter for already processed
pages.

Fixes: 0512a75b98 ("block: Introduce REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org # 5.8+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-28 07:51:02 -06:00
Mauro Carvalho Chehab 5cd3ddc186 docs: bio: fix a kerneldoc markup
Fix this warning:

	./block/bio.c:1098: WARNING: Inline emphasis start-string without end-string.

The thing is that *iter is not a valid markup.

That seems to be a typo:
	*iter -> @iter

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2020-10-15 07:49:48 +02:00
Mauro Carvalho Chehab 5b874af627 block: bio: fix a warning at the kernel-doc markups
Using "@bio's parent" causes the following waring:
	./block/bio.c:10: WARNING: Inline emphasis start-string without end-string.

The main problem here is that this would be converted into:

	**bio**'s parent

By kernel-doc, which is not a valid notation. It would be
possible to use, instead, this kernel-doc markup:

	``bio's`` parent

Yet, here, is probably simpler to just use an altenative language:

	the parent of @bio

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2020-10-15 07:49:47 +02:00
Linus Torvalds 3ad11d7ac8 block-5.10-2020-10-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl
 u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1
 jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf
 fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x
 e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy
 6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z
 WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943
 8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf
 c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl
 FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC
 YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm
 ZwfpDh2+Tg==
 =LzyE
 -----END PGP SIGNATURE-----

Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Series of merge handling cleanups (Baolin, Christoph)

 - Series of blk-throttle fixes and cleanups (Baolin)

 - Series cleaning up BDI, seperating the block device from the
   backing_dev_info (Christoph)

 - Removal of bdget() as a generic API (Christoph)

 - Removal of blkdev_get() as a generic API (Christoph)

 - Cleanup of is-partition checks (Christoph)

 - Series reworking disk revalidation (Christoph)

 - Series cleaning up bio flags (Christoph)

 - bio crypt fixes (Eric)

 - IO stats inflight tweak (Gabriel)

 - blk-mq tags fixes (Hannes)

 - Buffer invalidation fixes (Jan)

 - Allow soft limits for zone append (Johannes)

 - Shared tag set improvements (John, Kashyap)

 - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

 - DM no-wait support (Mike, Konstantin)

 - Request allocation improvements (Ming)

 - Allow md/dm/bcache to use IO stat helpers (Song)

 - Series improving blk-iocost (Tejun)

 - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
   Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
  block: fix uapi blkzoned.h comments
  blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
  blk-mq: get rid of the dead flush handle code path
  block: get rid of unnecessary local variable
  block: fix comment and add lockdep assert
  blk-mq: use helper function to test hw stopped
  block: use helper function to test queue register
  block: remove redundant mq check
  block: invoke blk_mq_exit_sched no matter whether have .exit_sched
  percpu_ref: don't refer to ref->data if it isn't allocated
  block: ratelimit handle_bad_sector() message
  blk-throttle: Re-use the throtl_set_slice_end()
  blk-throttle: Open code __throtl_de/enqueue_tg()
  blk-throttle: Move service tree validation out of the throtl_rb_first()
  blk-throttle: Move the list operation after list validation
  blk-throttle: Fix IO hang for a corner case
  blk-throttle: Avoid tracking latency if low limit is invalid
  blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
  blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
  block: Remove redundant 'return' statement
  ...
2020-10-13 12:12:44 -07:00
Eric Biggers 07560151db block: make bio_crypt_clone() able to fail
bio_crypt_clone() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

However, bio_crypt_clone() might be called with GFP_ATOMIC via
setup_clone() in drivers/md/dm-rq.c, or with GFP_NOWAIT via
kcryptd_io_read() in drivers/md/dm-crypt.c.

Neither case is currently reachable with a bio that actually has an
encryption context.  However, it's fragile to rely on this.  Just make
bio_crypt_clone() able to fail, analogous to bio_integrity_clone().

Reported-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Satya Tangirala <satyat@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-05 10:47:43 -06:00
Ritesh Harjani 2cd896a5e8 block: Set same_page to false in __bio_try_merge_page if ret is false
If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway
not merging this page in this bio, then it make sense to make same_page
also as false before returning.

Without this patch, we hit below WARNING in iomap.
This mostly happens with very large memory system and / or after tweaking
vm dirty threshold params to delay writeback of dirty data.

WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150
 CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G        W         5.8.0-rc3 #6
 Call Trace:
  __remove_mapping+0x154/0x320 (unreliable)
  iomap_releasepage+0x80/0x180
  try_to_release_page+0x94/0xe0
  invalidate_inode_page+0xc8/0x110
  invalidate_mapping_pages+0x1dc/0x540
  generic_fadvise+0x3c8/0x450
  xfs_file_fadvise+0x2c/0xe0 [xfs]
  vfs_fadvise+0x3c/0x60
  ksys_fadvise64_64+0x68/0xe0
  sys_fadvise64+0x28/0x40
  system_call_exception+0xf8/0x1c0
  system_call_common+0xf0/0x278

Fixes: cc90bc6842 ("block: fix "check bi_size overflow before merge"")
Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-09 08:18:45 -06:00
Matthew Wilcox (Oracle) d81665198b block: Fix page_is_mergeable() for compound pages
If we pass in an offset which is larger than PAGE_SIZE, then
page_is_mergeable() thinks it's not mergeable with the previous bio_vec,
leading to a large number of bio_vecs being used.  Use a slightly more
obvious test that the two pages are compatible with each other.

Fixes: 52d52d1c98 ("block: only allow contiguous page structs in a bio_vec")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-17 19:35:53 -07:00
Randy Dunlap 3cf1488917 block: bio: delete duplicated words
Drop the repeated words "a" and "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-31 16:29:47 -06:00
Christoph Hellwig ed00aabd5e block: rename generic_make_request to submit_bio_noacct
generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Christoph Hellwig 28fc591ff9 block: move the bio cgroup associatation helpers to blk-cgroup.c
Keep the cgroup code together.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29 09:09:08 -06:00
Christoph Hellwig a18b9b1590 block: move bio_associate_blkg_from_page to mm/page_io.c
bio_associate_blkg_from_page is a special purpose helper for swap bios
that doesn't need access to bio internals.  Move it to the swap code
instead of having it in bio.c.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29 09:09:08 -06:00
Christoph Hellwig 2badf06cf9 block: merge __bio_associate_blkg into bio_associate_blkg_from_css
Merge __bio_associate_blkg into the only caller, which allows to slightly
reduce the RCU crticial section and better explain the code flow.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29 09:09:08 -06:00
Christoph Hellwig d92c370a16 block: really clone the block cgroup in bio_clone_blkg_association
bio_clone_blkg_association is supposed to clone the associatation, but
actually ends up doing a search with a tryget.  As we know we have a
reference on the source cgroup just get an unconditional additional
reference to it and call it a day.  That also removes the need for
a RCU critical section.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29 09:09:08 -06:00
Christoph Hellwig db9819c76c block: remove bio_disassociate_blkg
bio_disassociate_blkg has two callers, of which one immediately assigns
a new value to >bi_blkg.  Just open code the function in the two callers.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29 09:09:08 -06:00
Gustavo A. R. Silva 1f4fe21cf4 block: bio: Use struct_size() in kmalloc()
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.

This code was detected with the help of Coccinelle and, audited and
fixed manually.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Christoph Hellwig d24de76af8 block: remove the error argument to the block_bio_complete tracepoint
The status can be trivially derived from the bio itself.  That also avoid
callers like NVMe to incorrectly pass a blk_status_t instead of the errno,
and the overhead of translating the blk_status_t to the errno in the I/O
completion fast path when no tracing is enabled.

Fixes: 35fe0d12c8 ("nvme: trace bio completion")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 21:16:11 -06:00