Commit Graph

4101 Commits

Author SHA1 Message Date
Omar Sandoval f0a0cdddb1 kyber: fix integer overflow of latency targets on 32-bit
NSEC_PER_SEC has type long, so 5 * NSEC_PER_SEC is calculated as a long.
However, 5 seconds is 5,000,000,000 nanoseconds, which overflows a
32-bit long. Make sure all of the targets are calculated as 64-bit
values.

Fixes: 6e25cb01ea ("kyber: implement improved heuristics")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28 10:49:39 -06:00
Hannes Reinecke fef912bf86 block: genhd: add 'groups' argument to device_add_disk
Update device_add_disk() to take an 'groups' argument so that
individual drivers can register a device with additional sysfs
attributes.
This avoids race condition the driver would otherwise have if these
groups were to be created with sysfs_add_groups().

Signed-off-by: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28 08:30:28 -06:00
Omar Sandoval 6c3b7af1c9 kyber: add tracepoints
When debugging Kyber, it's really useful to know what latencies we've
been having, how the domain depths have been adjusted, and if we've
actually been throttling. Add three tracepoints, kyber_latency,
kyber_adjust, and kyber_throttled, to record that.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27 17:34:59 -06:00
Omar Sandoval 6e25cb01ea kyber: implement improved heuristics
Kyber's current heuristics have a few flaws:

- It's based on the mean latency, but p99 latency tends to be more
  meaningful to anyone who cares about latency. The mean can also be
  skewed by rare outliers that the scheduler can't do anything about.
- The statistics calculations are purely time-based with a short window.
  This works for steady, high load, but is more sensitive to outliers
  with bursty workloads.
- It only considers the latency once an I/O has been submitted to the
  device, but the user cares about the time spent in the kernel, as
  well.

These are shortcomings of the generic blk-stat code which doesn't quite
fit the ideal use case for Kyber. So, this replaces the statistics with
a histogram used to calculate percentiles of total latency and I/O
latency, which we then use to adjust depths in a slightly more
intelligent manner:

- Sync and async writes are now the same domain.
- Discards are a separate domain.
- Domain queue depths are scaled by the ratio of the p99 total latency
  to the target latency (e.g., if the p99 latency is double the target
  latency, we will double the queue depth; if the p99 latency is half of
  the target latency, we can halve the queue depth).
- We use the I/O latency to determine whether we should scale queue
  depths down: we will only scale down if any domain's I/O latency
  exceeds the target latency, which is an indicator of congestion in the
  device.

These new heuristics are just as scalable as the heuristics they
replace.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27 17:34:57 -06:00
Omar Sandoval fa2a1f609e kyber: don't make domain token sbitmap larger than necessary
The domain token sbitmaps are currently initialized to the device queue
depth or 256, whichever is larger, and immediately resized to the
maximum depth for that domain (256, 128, or 64 for read, write, and
other, respectively). The sbitmap is never resized larger than that, so
it's unnecessary to allocate a bitmap larger than the maximum depth.
Let's just allocate it to the maximum depth to begin with. This will use
marginally less memory, and more importantly, give us a more appropriate
number of bits per sbitmap word.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27 17:34:56 -06:00
Omar Sandoval f8232f29ca block: export blk_stat_enable_accounting()
Kyber will need this in a future change if it is built as a module.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27 17:34:54 -06:00
Omar Sandoval ed88660a53 block: move call of scheduler's ->completed_request() hook
Commit 4bc6339a58 ("block: move blk_stat_add() to
__blk_mq_end_request()") consolidated some calls using ktime_get() so
we'd only need to call it once. Kyber's ->completed_request() hook also
calls ktime_get(), so let's move it to the same place, too.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27 17:34:52 -06:00
Ilya Dryomov 587562d0c7 blk-mq: I/O and timer unplugs are inverted in blktrace
trace_block_unplug() takes true for explicit unplugs and false for
implicit unplugs.  schedule() unplugs are implicit and should be
reported as timer unplugs.  While correct in the legacy code, this has
been inverted in blk-mq since 4.11.

Cc: stable@vger.kernel.org
Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27 13:12:44 -06:00
Damien Le Moal 854f31ccdd block: fix deadline elevator drain for zoned block devices
When the deadline scheduler is used with a zoned block device, writes
to a zone will be dispatched one at a time. This causes the warning
message:

deadline: forced dispatching is broken (nr_sorted=X), please report this

to be displayed when switching to another elevator with the legacy I/O
path while write requests to a zone are being retained in the scheduler
queue.

Prevent this message from being displayed when executing
elv_drain_elevator() for a zoned block device. __blk_drain_queue() will
loop until all writes are dispatched and completed, resulting in the
desired elevator queue drain without extensive modifications to the
deadline code itself to handle forced-dispatch calls.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Fixes: 8dc8146f9c ("deadline-iosched: Introduce zone locking support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 19:57:24 -06:00
Bart Van Assche 986d413b7c blk-mq: Enable support for runtime power management
Now that the blk-mq core processes power management requests
(marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable
runtime power management for blk-mq.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 15:11:29 -06:00
Bart Van Assche 7cedffec8e block: Make blk_get_request() block for non-PM requests while suspended
Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
For blk-mq, instead of maintaining the q->nr_pending counter, rely
on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
request finishes instead of only if the queue depth drops to zero.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 15:11:29 -06:00
Bart Van Assche bdd6316094 block: Allow unfreezing of a queue while requests are in progress
A later patch will call blk_freeze_queue_start() followed by
blk_mq_unfreeze_queue() without waiting for q_usage_counter to drop
to zero. Make sure that this doesn't cause a kernel warning to appear
by switching from percpu_ref_reinit() to percpu_ref_resurrect(). The
former namely requires that the refcount it operates on is zero.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 15:11:29 -06:00
Bart Van Assche 0d25bd072b block: Schedule runtime resume earlier
Instead of scheduling runtime resume of a request queue after a
request has been queued, schedule asynchronous resume during request
allocation. The new pm_request_resume() calls occur after
blk_queue_enter() has increased the q_usage_counter request queue
member. This change is needed for a later patch that will make request
allocation block while the queue status is not RPM_ACTIVE.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 15:11:28 -06:00
Bart Van Assche 154b00d566 block: Split blk_pm_add_request() and blk_pm_put_request()
Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into
two new functions and thereby separate legacy block layer code from code
that works for both the legacy block layer and blk-mq. A later patch will
add calls to the new functions in the blk-mq code.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 15:11:28 -06:00
Bart Van Assche cd84a62e00 block, scsi: Change the preempt-only flag into a counter
The RQF_PREEMPT flag is used for three purposes:
- In the SCSI core, for making sure that power management requests
  are executed even if a device is in the "quiesced" state.
- For domain validation by SCSI drivers that use the parallel port.
- In the IDE driver, for IDE preempt requests.
Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context. Since
with this change scsi_device_quiesce() is no longer idempotent, make
that function return early if it is called for a quiesced queue.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 15:11:28 -06:00
Bart Van Assche bca6b067b0 block: Move power management code into a new source file
Move the code for runtime power management from blk-core.c into the
new source file blk-pm.c. Move the corresponding declarations from
<linux/blkdev.h> into <linux/blk-pm.h>. For CONFIG_PM=n, leave out
the declarations of the functions that are not used in that mode.
This patch not only reduces the number of #ifdefs in the block layer
core code but also reduces the size of header file <linux/blkdev.h>
and hence should help to reduce the build time of the Linux kernel
if CONFIG_PM is not defined.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 15:11:28 -06:00
Christoph Hellwig c39ae60dfb block: remove ARCH_BIOVEC_PHYS_MERGEABLE
Take the Xen check into the core code instead of delegating it to
the architectures.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 08:45:11 -06:00
Keith Busch 530ca2c9bd blk-mq: Allow blocking queue tag iter callbacks
A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.

Fixes: f5bbbbe4d6 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
Acked-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-25 20:17:59 -06:00
Christoph Hellwig 6e768461c2 block: remove bvec_to_phys
We only use it in biovec_phys_mergeable and a m68k paravirt driver,
so just opencode it there.  Also remove the pointless unsigned long cast
for the offset in the opencoded instances.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:59 -06:00
Christoph Hellwig 3dccdae54f block: merge BIOVEC_SEG_BOUNDARY into biovec_phys_mergeable
These two checks should always be performed together, so merge them into
a single helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:57 -06:00
Christoph Hellwig 0e253391a9 block: add a missing BIOVEC_SEG_BOUNDARY check in bio_add_pc_page
The actual recaculation of segments in __blk_recalc_rq_segments will
do this check, so there is no point in forcing it if we know it won't
succeed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:55 -06:00
Christoph Hellwig 6a9f5f240a block: simplify BIOVEC_PHYS_MERGEABLE
Turn the macro into an inline, move it to blk.h and simplify the
arch hooks a bit.

Also rename the function to biovec_phys_mergeable as there is no need
to shout.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:54 -06:00
Christoph Hellwig 27ca1d4ed0 block: move req_gap_back_merge to blk.h
No need to expose these helpers outside the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:53 -06:00
Christoph Hellwig e9907009cb block: move req_gap_{back,front}_merge to blk-merge.c
Keep it close to the actual users instead of exposing the function to all
drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:51 -06:00
Christoph Hellwig 43b729bfe9 block: move integrity_req_gap_{back,front}_merge to blk.h
No need to expose these to drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:50 -06:00
Bart Van Assche c7b1bf5cca blk-mq: Document the functions that iterate over requests
Make it easier to understand the purpose of the functions that iterate
over requests by documenting their purpose. Fix several minor spelling
and grammer mistakes in comments in these functions.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:30:22 -06:00
Dennis Zhou (Facebook) 101246ec02 blkcg: rename blkg_try_get to blkg_tryget
blkg reference counting now uses percpu_ref rather than atomic_t. Let's
make this consistent with css_tryget. This renames blkg_try_get to
blkg_tryget and now returns a bool rather than the blkg or NULL.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:19 -06:00
Dennis Zhou (Facebook) b3b9f24f5f blkcg: change blkg reference counting to use percpu_ref
Now that every bio is associated with a blkg, this puts the use of
blkg_get, blkg_try_get, and blkg_put on the hot path. This switches over
the refcnt in blkg to use percpu_ref.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:18 -06:00
Dennis Zhou (Facebook) f0fcb3ec89 blkcg: remove additional reference to the css
The previous patch in this series removed carrying around a pointer to
the css in blkg. However, the blkg association logic still relied on
taking a reference on the css to ensure we wouldn't fail in getting a
reference for the blkg.

Here the implicit dependency on the css is removed. The association
continues to rely on the tryget logic walking up the blkg tree. This
streamlines the three ways that association can happen: normal, swap,
and writeback.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:15 -06:00
Dennis Zhou (Facebook) c839e7a03f blkcg: remove bio->bi_css and instead use bio->bi_blkg
Prior patches ensured that all bios are now associated with some blkg.
This now makes bio->bi_css unnecessary as blkg maintains a reference to
the blkcg already.

This patch removes the field bi_css and transfers corresponding uses to
access via bi_blkg.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:13 -06:00
Dennis Zhou (Facebook) 74b7c02a9b blkcg: associate a blkg for pages being evicted by swap
A prior patch in this series added blkg association to bios issued by
cgroups. There are two other paths that we want to attribute work back
to the appropriate cgroup: swap and writeback. Here we modify the way
swap tags bios to include the blkg. Writeback will be tackle in the next
patch.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:09 -06:00
Dennis Zhou (Facebook) 5bf9a1f3b4 blkcg: consolidate bio_issue_init to be a part of core
bio_issue_init among other things initializes the timestamp for an IO.
Rather than have this logic handled by policies, this consolidates it to
be on the init paths (normal, clone, bounce clone).

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:08 -06:00
Dennis Zhou (Facebook) a7b39b4e96 blkcg: always associate a bio with a blkg
Previously, blkg's were only assigned as needed by blk-iolatency and
blk-throttle. bio->css was also always being associated while blkg was
being looked up and then thrown away in blkcg_bio_issue_check.

This patch begins the cleanup of bio->css and bio->bi_blkg by always
associating a blkg in blkcg_bio_issue_check. This tries to create the
blkg, but if it is not possible, falls back to using the root_blkg of
the request_queue. Therefore, a bio will always be associated with a
blkg. The duplicate association logic is removed from blk-throttle and
blk-iolatency.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:06 -06:00
Dennis Zhou (Facebook) 07b05bcc32 blkcg: convert blkg_lookup_create to find closest blkg
There are several scenarios where blkg_lookup_create can fail. Examples
include the blkcg dying, request_queue is dying, or simply being OOM. At
the end of the day, most handle this by simply falling back to the
q->root_blkg and calling it a day.

This patch implements the notion of closest blkg. During
blkg_lookup_create, if it fails to create, return the closest blkg
found or the q->root_blkg. blkg_try_get_closest is introduced and used
during association so a bio is always attached to a blkg.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:05 -06:00
Dennis Zhou (Facebook) 49f4c2dc2b blkcg: update blkg_lookup_create to do locking
To know when to create a blkg, the general pattern is to do a
blkg_lookup and if that fails, lock and then do a lookup again and if
that fails finally create. It doesn't make much sense for everyone who
wants to do creation to write this themselves.

This changes blkg_lookup_create to do locking and implement this
pattern. The old blkg_lookup_create is renamed to __blkg_lookup_create.
If a call site wants to do its own error handling or already owns the
queue lock, they can use __blkg_lookup_create. This will be used in
upcoming patches.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:03 -06:00
Dennis Zhou (Facebook) 27e6fa996c blkcg: fix ref count issue with bio_blkcg using task_css
The accessor function bio_blkcg either returns the blkcg associated with
the bio or finds one in the current context. This can cause an issue
when trying to associate a bio with a blkcg. Particularly, it's the
third case that is problematic:

	return css_to_blkcg(task_css(current, io_cgrp_id));

As the above may race against task migration and the cgroup exiting, it
is not always ok to take a reference on the blkcg returned from
bio_blkcg.

This patch adds association ahead of calling bio_blkcg rather than
after. This makes association a required and explicit step along the
code paths for calling bio_blkcg. blk_get_rl is modified as well to get
a reference to the blkcg it may use and blk_put_rl will always put the
reference back. Association is also moved above the bio_blkcg call to
ensure it will not return NULL in blk-iolatency.

BFQ and CFQ utilize this flaw, but due to the complexity, I do not want
to address this in this series. I've created a private version of the
function with notes not to use it describing the flaw. Hopefully soon,
that code can be cleaned up.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:02 -06:00
Omar Sandoval b57e99b4b8 block: use nanosecond resolution for iostat
Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.

Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.

Fixes: 522a777566 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: Klaus Kusche <klaus.kusche@computerix.info>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:26:59 -06:00
Liu Bo 9ff01255a0 Blk-throttle: update to use rbtree with leftmost node cached
As rbtree has native support of caching leftmost node,
i.e. rb_root_cached, no need to do the caching by ourselves.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-20 19:31:46 -06:00
Christoph Hellwig 576ed91354 block: use bio_add_page in bio_iov_iter_get_pages
Replace a nasty hack with a different nasty hack to prepare for multipage
bio_vecs.  By moving the temporary page array as far up as possible in
the space allocated for the bio_vec array we can iterate forward over it
and thus use bio_add_page.  Using bio_add_page means we'll be able to
merge physically contiguous pages once support for multipath bio_vecs is
merged.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-20 09:11:40 -06:00
Paolo Valente c8765de0ad blok, bfq: do not plug I/O if all queues are weight-raised
To reduce latency for interactive and soft real-time applications, bfq
privileges the bfq_queues containing the I/O of these
applications. These privileged queues, referred-to as weight-raised
queues, get a much higher share of the device throughput
w.r.t. non-privileged queues. To preserve this higher share, the I/O
of any non-weight-raised queue must be plugged whenever a sync
weight-raised queue, while being served, remains temporarily empty. To
attain this goal, bfq simply plugs any I/O (from any queue), if a sync
weight-raised queue remains empty while in service.

Unfortunately, this plugging typically lowers throughput with random
I/O, on devices with internal queueing (because it reduces the filling
level of the internal queues of the device).

This commit addresses this issue by restricting the cases where
plugging is performed: if a sync weight-raised queue remains empty
while in service, then I/O plugging is performed only if some of the
active bfq_queues are *not* weight-raised (which is actually the only
circumstance where plugging is needed to preserve the higher share of
the throughput of weight-raised queues). This restriction proved able
to boost throughput in really many use cases needing only maximum
throughput.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-14 13:06:05 -06:00
Paolo Valente d0edc2473b block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash
The Achilles' heel of BFQ is its failing to reach a high throughput
with sync random I/O on flash storage with internal queueing, in case
the processes doing I/O have differentiated weights.

The cause of this failure is as follows. If at least two processes do
sync I/O, and have a different weight from each other, then BFQ plugs
I/O dispatching every time one of these processes, while it is being
served, remains temporarily without pending I/O requests. This
plugging is necessary to guarantee that every process enjoys a
bandwidth proportional to its weight; but it empties the internal
queue(s) of the drive. And this kills throughput with random I/O. So,
if some processes have differentiated weights and do both sync and
random I/O, the end result is a throughput collapse.

This commit tries to counter this problem by injecting the service of
other processes, in a controlled way, while the process in service
happens to have no I/O. This injection is performed only if the medium
is non rotational and performs internal queueing, and the process in
service does random I/O (service injection might be beneficial for
sequential I/O too, we'll work on that).

As an example of the benefits of this commit, on a PLEXTOR PX-256M5S
SSD, and with five processes having differentiated weights and doing
sync random 4KB I/O, this commit makes the throughput with bfq grow by
400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than
that reached with none. As some less random I/O is added to the mix,
the throughput becomes equal to or higher than that with none.

This commit is a very first attempt to recover throughput without
losing control, and certainly has many limitations. One is, e.g., that
the processes whose service is injected are not chosen so as to
distribute the extra bandwidth they receive in accordance to their
weights. Thus there might be loss of weighted fairness in some
cases. Anyway, this loss concerns extra service, which would not have
been received at all without this commit. Other limitations and issues
will probably show up with usage.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-14 13:06:03 -06:00
Paolo Valente cbeb869a3d block, bfq: correctly charge and reset entity service in all cases
BFQ schedules entities (which represent either per-process queues or
groups of queues) as a function of their timestamps. In particular, as
a function of their (virtual) finish times. The finish time of an
entity is computed as a function of the budget assigned to the entity,
assuming, tentatively, that the entity, once in service, will receive
an amount of service equal to its budget. Then, when the entity is
expired because it finishes to be served, this finish time is updated
as a function of the actual service received by the entity. This
allows the entity to be correctly charged with only the service
received, and then to be correctly re-scheduled.

Yet an entity may receive service also while not being the entity in
service (in the scheduling environment of its parent entity), for
several reasons. If the entity remains with no backlog while receiving
this 'unofficial' service, then it is expired. Also on such an
expiration, the finish time of the entity should be updated to account
for only the service actually received by the entity. Unfortunately,
such an update is not performed for an entity expiring without being
the entity in service.

In a similar vein, the service counter of the entity in service is
reset when the entity is expired, to be ready to be used for next
service cycle. This reset too should be performed also in case an
entity is expired because it remains empty after receiving service
while not being the entity in service. But in this case the reset is
not performed.

This commit performs the above update of the finish time and reset of
the service received, also for an entity expiring while not being the
entity in service.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-14 13:06:02 -06:00
YueHaibing f8c0d7b16f blk-iolatency: remove set but not used variables 'changed' and 'blkiolat'
Fixes gcc '-Wunused-but-set-variable' warning:

block/blk-iolatency.c: In function 'scale_change':
block/blk-iolatency.c:301:7: warning:
 variable 'changed' set but not used [-Wunused-but-set-variable]

block/blk-iolatency.c: In function 'iolatency_set_limit':
block/blk-iolatency.c:765:24: warning:
 variable 'blkiolat' set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-13 19:59:24 -06:00
Jens Axboe 01c5f85aeb blk-cgroup: increase number of supported policies
After merging the iolatency policy, we potentially now have 4 policies
being registered, but only support 3. This causes one of them to fail
loading. Takashi reports that BFQ no longer works for him, because it
fails to load due to policy registration failure.

Bump to 5 policies, and also add a warning for when we have exceeded
the global amount. If we have to touch this again, we should switch
to a dynamic scheme instead.

Reported-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-11 10:59:53 -06:00
Ming Lei 7759eb23fd block: remove bio_rewind_iter()
It is pointed that bio_rewind_iter() is one very bad API[1]:

1) bio size may not be restored after rewinding

2) it causes some bogus change, such as 5151842b9d (block: reset
bi_iter.bi_done after splitting bio)

3) rewinding really makes things complicated wrt. bio splitting

4) unnecessary updating of .bi_done in fast path

[1] https://marc.info/?t=153549924200005&r=1&w=2

So this patch takes Kent's suggestion to restore one bio into its original
state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
given now bio_rewind_iter() is only used by bio integrity code.

Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Hannes Reinecke <hare@suse.com>
Suggested-by: Kent Overstreet <kent.overstreet@gmail.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-06 15:12:24 -06:00
Konstantin Khlebnikov d5274b3cd6 block: bfq: swap puts in bfqg_and_blkg_put
Fix trivial use-after-free. This could be last reference to bfqg.

Fixes: 8f9bebc33d ("block, bfq: access and cache blkg data only when safe")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-06 11:32:58 -06:00
Mikulas Patocka 8b2ded1c94 block: don't warn when doing fsync on read-only devices
It is possible to call fsync on a read-only handle (for example, fsck.ext2
does it when doing read-only check), and this call results in kernel
warning.

The patch b089cfd95d ("block: don't warn for flush on read-only device")
attempted to disable the warning, but it is buggy and it doesn't
(op_is_flush tests flags, but bio_op strips off the flags).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Cc: stable@vger.kernel.org	# 4.18
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-05 16:14:36 -06:00
Dennis Zhou (Facebook) 3111885015 blkcg: use tryget logic when associating a blkg with a bio
There is a very small change a bio gets caught up in a really
unfortunate race between a task migration, cgroup exiting, and itself
trying to associate with a blkg. This is due to css offlining being
performed after the css->refcnt is killed which triggers removal of
blkgs that reach their blkg->refcnt of 0.

To avoid this, association with a blkg should use tryget and fallback to
using the root_blkg.

Fixes: 08e18eab0c ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:58 -06:00
Dennis Zhou (Facebook) 59b57717ff blkcg: delay blkg destruction until after writeback has finished
Currently, blkcg destruction relies on a sequence of events:
  1. Destruction starts. blkcg_css_offline() is called and blkgs
     release their reference to the blkcg. This immediately destroys
     the cgwbs (writeback).
  2. With blkgs giving up their reference, the blkcg ref count should
     become zero and eventually call blkcg_css_free() which finally
     frees the blkcg.

Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
on the completion of all writeback associated with the blkcg. A count of
the number of cgwbs is maintained and once that goes to zero, blkg
destruction can follow. This should prevent premature blkg destruction
related to writeback.

The new process for blkcg cleanup is as follows:
  1. Destruction starts. blkcg_css_offline() is called which offlines
     writeback. Blkg destruction is delayed on the cgwb_refcnt count to
     avoid punting potentially large amounts of outstanding writeback
     to root while maintaining any ongoing policies. Here, the base
     cgwb_refcnt is put back.
  2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
     and handles destruction of blkgs. This is where the css reference
     held by each blkg is released.
  3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
     This finally frees the blkg.

It seems in the past blk-throttle didn't do the most understandable
things with taking data from a blkg while associating with current. So,
the simplification and unification of what blk-throttle is doing caused
this.

Fixes: 08e18eab0c ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:56 -06:00
Dennis Zhou (Facebook) 6b06546206 Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()"
This reverts commit 4c6994806f.

Destroying blkgs is tricky because of the nature of the relationship. A
blkg should go away when either a blkcg or a request_queue goes away.
However, blkg's pin the blkcg to ensure they remain valid. To break this
cycle, when a blkcg is offlined, blkgs put back their css ref. This
eventually lets css_free() get called which frees the blkcg.

The above commit (4c6994806f) breaks this order of events by trying to
destroy blkgs in css_free(). As the blkgs still hold references to the
blkcg, css_free() is never called.

The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
addressed in the following patch by delaying destruction of a blkg until
all writeback associated with the blkcg has been finished.

Fixes: 4c6994806f ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:54 -06:00
John Pittman db193954ed block: bsg: move atomic_t ref_count variable to refcount API
Currently, variable ref_count within the bsg_device struct is of
type atomic_t.  For variables being used as reference counters,
the refcount API should be used instead of atomic.  The newer
refcount API works to prevent counter overflows and use-after-free
bugs.  So, move this varable from the atomic API to refcount,
potentially avoiding the issues mentioned.

Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 19:17:02 -06:00
Chengguang Xu 62d2a19407 block: remove unnecessary condition check
kmem_cache_destroy() can handle NULL pointer correctly, so there is
no need to check e->icq_cache before calling kmem_cache_destroy().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 19:16:06 -06:00
Jens Axboe b0a84beb2e blk-wbt: remove dead code
We already note and mark discard and swap IO from bio_to_wbt_flags().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 13:32:12 -06:00
Jens Axboe 38cfb5a45e blk-wbt: improve waking of tasks
We have two potential issues:

1) After commit 2887e41b91, we only wake one process at the time when
   we finish an IO. We really want to wake up as many tasks as can
   queue IO. Before this commit, we woke up everyone, which could cause
   a thundering herd issue.

2) A task can potentially consume two wakeups, causing us to (in
   practice) miss a wakeup.

Fix both by providing our own wakeup function, which stops
__wake_up_common() from waking up more tasks if we fail to get a
queueing token. With the strict ordering we have on the wait list, this
wakes the right tasks and the right amount of tasks.

Based on a patch from Jianchao Wang <jianchao.w.wang@oracle.com>.

Tested-by: Agarwal, Anchal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 11:27:26 -06:00
Jens Axboe 061a542753 blk-wbt: abstract out end IO completion handler
Prep patch for calling the handler from a different context,
no functional changes in this patch.

Tested-by: Agarwal, Anchal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 11:27:24 -06:00
Jens Axboe c125311d96 blk-wbt: don't maintain inflight counts if disabled
A previous commit removed the ability to have per-rq flags. We used
those flags to maintain inflight counts. Since we don't have those
anymore, we have to always maintain inflight counts, even if wbt is
disabled. This is clearly suboptimal.

Add a queue quiesce around changing the wbt latency settings from sysfs
to work around this. With that, we can reliably put the enabled check in
our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be
consistent for the lifetime of the request.

Fixes: c1c80384c8 ("block: remove external dependency on wbt_flags")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-23 09:34:46 -06:00
Jens Axboe c45e6a037a blk-wbt: fix has-sleeper queueing check
We need to do this inside the loop as well, or we can allow new
IO to supersede previous IO.

Tested-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-22 15:07:32 -06:00
Jens Axboe b78820937b blk-wbt: use wq_has_sleeper() for wq active check
We need the memory barrier before checking the list head,
use the appropriate helper for this. The matching queue
side memory barrier is provided by set_current_state().

Tested-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-22 15:07:31 -06:00
Jens Axboe ffa358dcaa blk-wbt: move disable check into get_limit()
Check it in one place, instead of in multiple places.

Tested-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-22 15:07:31 -06:00
Linus Torvalds 5bed49adfe for-4.19/post-20180822
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlt9on8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj1xEADKBmJlV9aVyxc5w6XggqAGeHqI4afFrl+v
 9fW6WUQMAaBUrr7PMIEJQ0Zm4B7KxgBaEWNtuuj4ULkjpgYm2AuGUuTJSyKz41rS
 Ma+KNyCA2Zmq4SvwGFbcdCuCbUqnoxTycscAgCjuDvIYLW0+nFGNc47ibmu9lZIV
 33Ef5LrxuCjhC2zyNxEdWpUxDCjoYzock85LW+wYyIYLU9uKdoExS+YmT8U+ebA/
 AkXBcxPztNDxwkcsIwgGVoTjwxiowqGz3uueWfyEmYgaCPiNOsxkoNQAtjX4ykQE
 hnqnHWyzJkRwbYo7Vd/bRAZXvszKGYE1YcJmu5QrNf0dK5MSq2o5OYJAEJWbucPj
 m0R2u7O9qbS2JEnxGrm5+oYJwBzwNY5/Lajr15WkljTqobKnqcvn/Hdgz/XdGtek
 0S1QHkkBsF7e+cax8sePWK+O3ilY7pl9CzyZKB/tJngl8A45Jv8xVojg0v3O7oS+
 zZib0rwWg/bwR/uN6uPCDcEsQusqL5YovB7m6NRVshwz6cV1zVNp2q+iOulk7KuC
 MprW4Du9CJf8HA19XtyJIG1XLstnuz+Exy+i5BiimUJ5InoEFDuj/6OZa6Qaczbo
 SrDDvpGtSf4h7czKpE5kV4uZiTOrjuI30TrI+4csdZ7HQIlboxNL72seNTLJs55F
 nbLjRM8L6g==
 =FS7e
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:

 - Set of bcache fixes and changes (Coly)

 - The flush warn fix (me)

 - Small series of BFQ fixes (Paolo)

 - wbt hang fix (Ming)

 - blktrace fix (Steven)

 - blk-mq hardware queue count update fix (Jianchao)

 - Various little fixes

* tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
  block/DAC960.c: make some arrays static const, shrinks object size
  blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
  blk-mq: init hctx sched after update ctx and hctx mapping
  block: remove duplicate initialization
  tracing/blktrace: Fix to allow setting same value
  pktcdvd: fix setting of 'ret' error return for a few cases
  block: change return type to bool
  block, bfq: return nbytes and not zero from struct cftype .write() method
  block, bfq: improve code of bfq_bfqq_charge_time
  block, bfq: reduce write overcharge
  block, bfq: always update the budget of an entity when needed
  block, bfq: readd missing reset of parent-entity service
  blk-wbt: fix IO hang in wbt_wait()
  block: don't warn for flush on read-only device
  bcache: add the missing comments for smp_mb()/smp_wmb()
  bcache: remove unnecessary space before ioctl function pointer arguments
  bcache: add missing SPDX header
  bcache: move open brace at end of function definitions to next line
  bcache: add static const prefix to char * array declarations
  bcache: fix code comments style
  ...
2018-08-22 13:38:05 -07:00
Jianchao Wang f5bbbbe4d6 blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.

Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. The access to
nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
under rcu critical section, __blk_mq_update_nr_hw_queues could use
synchronize_rcu to ensure the zeroed q_usage_counter to be globally
visible.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-21 09:02:56 -06:00
Jianchao Wang d48ece209f blk-mq: init hctx sched after update ctx and hctx mapping
Currently, when update nr_hw_queues, IO scheduler's init_hctx will
be invoked before the mapping between ctx and hctx is adapted
correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
may depend on this mapping and get wrong result and panic finally.
A simply way to fix this is that switch the IO scheduler to 'none'
before update the nr_hw_queues, and then switch it back after
update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
to nobody use them any more.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-21 09:02:55 -06:00
Chaitanya Kulkarni fcedba42d9 block: remove duplicate initialization
This patch removes the duplicate initialization of q->queue_head
in the blk_alloc_queue_node(). This removes the 2nd initialization
so that we preserve the initialization order same as declaration
present in struct request_queue.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-17 12:33:36 -06:00
Chengguang Xu 599d067dd3 block: change return type to bool
Because blk_do_io_stat() only does a judgement about the request
contributes to IO statistics, it better changes return type to bool.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:44:17 -06:00
Maciej S. Szmigiero fc8ebd01de block, bfq: return nbytes and not zero from struct cftype .write() method
The value that struct cftype .write() method returns is then directly
returned to userspace as the value returned by write() syscall, so it
should be the number of bytes actually written (or consumed) and not zero.

Returning zero from write() syscall makes programs like /bin/echo or bash
spin.

Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Fixes: e21b7a0b98 ("block, bfq: add full hierarchical scheduling and cgroups support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:11:16 -06:00
Paolo Valente f812164869 block, bfq: improve code of bfq_bfqq_charge_time
bfq_bfqq_charge_time contains some lengthy and redundant code. This
commit trims and condenses that code.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:08:15 -06:00
Paolo Valente d5801088a7 block, bfq: reduce write overcharge
When a sync request is dispatched, the queue that contains that
request, and all the ancestor entities of that queue, are charged with
the number of sectors of the request. In constrast, if the request is
async, then the queue and its ancestor entities are charged with the
number of sectors of the request, multiplied by an overcharge
factor. This throttles the bandwidth for async I/O, w.r.t. to sync
I/O, and it is done to counter the tendency of async writes to steal
I/O throughput to reads.

On the opposite end, the lower this parameter, the stabler I/O
control, in the following respect.  The lower this parameter is, the
less the bandwidth enjoyed by a group decreases
- when the group does writes, w.r.t. to when it does reads;
- when other groups do reads, w.r.t. to when they do writes.

The fixes "block, bfq: always update the budget of an entity when
needed" and "block, bfq: readd missing reset of parent-entity service"
improved I/O control in bfq to such an extent that it has been
possible to revise this overcharge factor downwards.  This commit
introduces the resulting, new value.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:08:13 -06:00
Paolo Valente e02a0aa26b block, bfq: always update the budget of an entity when needed
When the next child entity to serve changes for a given parent entity,
the budget of that parent entity must be updated accordingly.
Unfortunately, this update is not performed, by mistake, for the
entities that happen to switch from having no child entity to serve,
to having one child entity to serve.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:08:12 -06:00
Paolo Valente 8a511ba5fe block, bfq: readd missing reset of parent-entity service
The received-service counter needs to be equal to 0 when an entity is
set in service. Unfortunately, commit "block, bfq: fix service being
wrongly set to zero in case of preemption" mistakenly removed the
resetting of this counter for the parent entities of the bfq_queue
being set in service. This commit fixes this issue by resetting
service for parent entities, directly on the expiration of the
in-service bfq_queue.

Fixes: 9fae8dd59f ("block, bfq: fix service being wrongly set to zero in case of preemption")
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:08:10 -06:00
Linus Torvalds 73ba2fb33c for-4.19/block-20180812
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
 tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
 5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
 thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
 QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
 6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
 yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
 2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
 5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
 FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
 Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
 Zolos5H+JA==
 =b9ib
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "First pull request for this merge window, there will also be a
  followup request with some stragglers.

  This pull request contains:

   - Fix for a thundering heard issue in the wbt block code (Anchal
     Agarwal)

   - A few NVMe pull requests:
      * Improved tracepoints (Keith)
      * Larger inline data support for RDMA (Steve Wise)
      * RDMA setup/teardown fixes (Sagi)
      * Effects log suppor for NVMe target (Chaitanya Kulkarni)
      * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
      * TP4004 (ANA) support (Christoph)
      * Various NVMe fixes

   - Block io-latency controller support. Much needed support for
     properly containing block devices. (Josef)

   - Series improving how we handle sense information on the stack
     (Kees)

   - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

   - Zoned device support for null_blk (Matias)

   - AIX partition fixes (Mauricio Faria de Oliveira)

   - DIF checksum code made generic (Max Gurtovoy)

   - Add support for discard in iostats (Michael Callahan / Tejun)

   - Set of updates for BFQ (Paolo)

   - Removal of async write support for bsg (Christoph)

   - Bio page dirtying and clone fixups (Christoph)

   - Set of bcache fix/changes (via Coly)

   - Series improving blk-mq queue setup/teardown speed (Ming)

   - Series improving merging performance on blk-mq (Ming)

   - Lots of other fixes and cleanups from a slew of folks"

* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
  blkcg: Make blkg_root_lookup() work for queues in bypass mode
  bcache: fix error setting writeback_rate through sysfs interface
  null_blk: add lock drop/acquire annotation
  Blk-throttle: reduce tail io latency when iops limit is enforced
  block: paride: pd: mark expected switch fall-throughs
  block: Ensure that a request queue is dissociated from the cgroup controller
  block: Introduce blk_exit_queue()
  blkcg: Introduce blkg_root_lookup()
  block: Remove two superfluous #include directives
  blk-mq: count the hctx as active before allocating tag
  block: bvec_nr_vecs() returns value for wrong slab
  bcache: trivial - remove tailing backslash in macro BTREE_FLAG
  bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
  bcache: set max writeback rate when I/O request is idle
  bcache: add code comments for bset.c
  bcache: fix mistaken comments in request.c
  bcache: fix mistaken code comments in bcache.h
  bcache: add a comment in super.c
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: display rate debug parameters to 0 when writeback is not running
  ...
2018-08-14 10:23:25 -07:00
Ming Lei df60f6e835 blk-wbt: fix IO hang in wbt_wait()
On wbt invariant is that if one IO is tracked via WBT_TRACKED, rqw->inflight
should be updated for tracking this IO.

But commit c1c80384c8 ("block: remove external dependency on wbt_flags")
forgets to remove the early handling of !rwb_enabled(rwb) inside wbt_wait(),
then the inflight counter may not be increased in wbt_wait(), but decreased
in wbt_done() for this kind of IO, so this counter may become negative, then
wbt_wait() may wait forever.

This patch fixes the report in the following link:

	https://marc.info/?l=linux-block&m=153221542021033&w=2

Fixes: c1c80384c8 ("block: remove external dependency on wbt_flags")
Cc: Josef Bacik <jbacik@fb.com>
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-14 11:05:52 -06:00
Jens Axboe b089cfd95d block: don't warn for flush on read-only device
Don't warn for a flush issued to a read-only device. It's not strictly
a writable command, as it doesn't change any on-media data by itself.

Reported-by: Stefan Agner <stefan@agner.ch>
Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-14 10:52:40 -06:00
Bart Van Assche b86d865cb1 blkcg: Make blkg_root_lookup() work for queues in bypass mode
For legacy queues the only call of blkg_root_lookup() happens after
bypass mode has been enabled. Since blkg_lookup() returns NULL for
queues in bypass mode, modify the blkg_root_lookup() such that it
no longer depends on bypass mode. Rename the function into
blk_queue_root_blkg() as suggested by Tejun.

Suggested-by: Tejun Heo <tj@kernel.org>
Fixes: 6bad9b210a ("blkcg: Introduce blkg_root_lookup()")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:41:25 -06:00
Liu Bo 991f61fe7e Blk-throttle: reduce tail io latency when iops limit is enforced
When an application's iops has exceeded its cgroup's iops limit, surely it
is throttled and kernel will set a timer for dispatching, thus IO latency
includes the delay.

However, the dispatch delay which is calculated by the limit and the
elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
once the application's iops is (iops limit + 1), it doesn't need to wait
any longer than the remaining time of the current slice.

The difference can be proved by the following fio job and cgroup iops
setting,
-----
$ echo 4 > /mnt/config/nullb/disk1/mbps    # limit nullb's bandwidth to 4MB/s for testing.
$ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max
$ cat r2.job
[global]
name=fio-rand-read
filename=/dev/nullb1
rw=randread
bs=4k
direct=1
numjobs=1
time_based=1
runtime=60
group_reporting=1

[file1]
size=4G
ioengine=libaio
iodepth=1
rate_iops=50000
norandommap=1
thinktime=4ms
-----

wo patch:
file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7-66-gedfc
Starting 1 process

   read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec)
    slat (usec): min=10, max=336, avg=27.71, stdev=17.82
    clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29
     lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    6], 60.00th=[11731],
     | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676],
     | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035],
     | 99.99th=[28967]

w/ patch:
file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7-66-gedfc
Starting 1 process

   read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec)
    slat (usec): min=10, max=155, avg=23.24, stdev=16.79
    clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25
     lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    5], 50.00th=[   47], 60.00th=[11863],
     | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
     | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125],
     | 99.99th=[12387]

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 12:43:16 -06:00
Bart Van Assche 24ecc35853 block: Ensure that a request queue is dissociated from the cgroup controller
Several block drivers call alloc_disk() followed by put_disk() if
something fails before device_add_disk() is called without calling
blk_cleanup_queue(). Make sure that also for this scenario a request
queue is dissociated from the cgroup controller. This patch avoids
that loading the parport_pc, paride and pf drivers triggers the
following kernel crash:

BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
Read of size 4 at addr 0000000000000008 by task modprobe/744
Call Trace:
dump_stack+0x9a/0xeb
kasan_report+0x139/0x350
pi_init+0x42e/0x580 [paride]
pf_init+0x2bb/0x1000 [pf]
do_one_initcall+0x8e/0x405
do_init_module+0xd9/0x2f2
load_module+0x3ab4/0x4700
SYSC_finit_module+0x176/0x1a0
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

Reported-by: Alexandru Moise <00moses.alexander00@gmail.com>
Fixes: a063057d7c ("block: Fix a race between request queue removal and the block cgroup controller") # v4.17
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Alexandru Moise <00moses.alexander00@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 09:13:00 -06:00
Bart Van Assche 4cf6324b17 block: Introduce blk_exit_queue()
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 09:12:59 -06:00
Jianchao Wang d263ed9926 blk-mq: count the hctx as active before allocating tag
Currently, we count the hctx as active after allocate driver tag
successfully. If a previously inactive hctx try to get tag first
time, it may fails and need to wait. However, due to the stale tag
->active_queues, the other shared-tags users are still able to
occupy all driver tags while there is someone waiting for tag.
Consequently, even if the previously inactive hctx is waked up, it
still may not be able to get a tag and could be starved.

To fix it, we count the hctx as active before try to allocate driver
tag, then when it is waiting the tag, the other shared-tag users
will reserve budget for it.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:34:17 -06:00
Greg Edwards d6c02a9beb block: bvec_nr_vecs() returns value for wrong slab
In commit ed996a52c8 ("block: simplify and cleanup bvec pool
handling"), the value of the slab index is incremented by one in
bvec_alloc() after the allocation is done to indicate an index value of
0 does not need to be later freed.

bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
value.  Decrement idx before performing the lookup.

Fixes: ed996a52c8 ("block: simplify and cleanup bvec pool handling")
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:25:04 -06:00
Bart Van Assche f7ecb1b109 cfq: Suppress compiler warnings about comparisons
This patch does not change any functionality but avoids that gcc
reports the following warnings when building with W=1:

block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
 STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 ^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
 STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
 ^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
 STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
 ^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_low_latency_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
 STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
 ^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
 USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
 ^~~~~~~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
  if (__data < (MIN))      \
             ^
block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
 USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
 ^~~~~~~~~~~~~~~~~~~

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-07 17:57:13 -06:00
Bart Van Assche 9b4f43460d cfq: Annotate fall-through in a switch statement
This patch avoids that gcc complains about fall-through when building
with W=1.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-07 17:57:11 -06:00
Anchal Agarwal 2887e41b91 blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
I am currently running a large bare metal instance (i3.metal)
on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
4.18 kernel. I have a workload that simulates a database
workload and I am running into lockup issues when writeback
throttling is enabled,with the hung task detector also
kicking in.

Crash dumps show that most CPUs (up to 50 of them) are
all trying to get the wbt wait queue lock while trying to add
themselves to it in __wbt_wait (see stack traces below).

[    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
[    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
[    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
[    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
[    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
[    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
[    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
[    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
[    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
[    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.948138] Call Trace:
[    0.948139]  <IRQ>
[    0.948142]  do_raw_spin_lock+0xad/0xc0
[    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.948149]  ? __wake_up_common_lock+0x53/0x90
[    0.948150]  __wake_up_common_lock+0x53/0x90
[    0.948155]  wbt_done+0x7b/0xa0
[    0.948158]  blk_mq_free_request+0xb7/0x110
[    0.948161]  __blk_mq_complete_request+0xcb/0x140
[    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
[    0.948169]  nvme_irq+0x23/0x50 [nvme]
[    0.948173]  __handle_irq_event_percpu+0x46/0x300
[    0.948176]  handle_irq_event_percpu+0x20/0x50
[    0.948179]  handle_irq_event+0x34/0x60
[    0.948181]  handle_edge_irq+0x77/0x190
[    0.948185]  handle_irq+0xaf/0x120
[    0.948188]  do_IRQ+0x53/0x110
[    0.948191]  common_interrupt+0x87/0x87
[    0.948192]  </IRQ>
....
[    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
[    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
[    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
[    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
[    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
[    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
[    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
[    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
[    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
[    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
[    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.311154] Call Trace:
[    0.311157]  do_raw_spin_lock+0xad/0xc0
[    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
[    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
[    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
[    0.311167]  wbt_wait+0x127/0x330
[    0.311169]  ? finish_wait+0x80/0x80
[    0.311172]  ? generic_make_request+0xda/0x3b0
[    0.311174]  blk_mq_make_request+0xd6/0x7b0
[    0.311176]  ? blk_queue_enter+0x24/0x260
[    0.311178]  ? generic_make_request+0xda/0x3b0
[    0.311181]  generic_make_request+0x10c/0x3b0
[    0.311183]  ? submit_bio+0x5c/0x110
[    0.311185]  submit_bio+0x5c/0x110
[    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
[    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
[    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
[    0.311229]  ? do_writepages+0x3c/0xd0
[    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
[    0.311240]  do_writepages+0x3c/0xd0
[    0.311243]  ? _raw_spin_unlock+0x24/0x30
[    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
[    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
[    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
[    0.311253]  file_write_and_wait_range+0x34/0x90
[    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
[    0.311267]  do_fsync+0x38/0x60
[    0.311270]  SyS_fsync+0xc/0x10
[    0.311272]  do_syscall_64+0x6f/0x170
[    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

In the original patch, wbt_done is waking up all the exclusive
processes in the wait queue, which can cause a thundering herd
if there is a large number of writer threads in the queue. The
original intention of the code seems to be to wake up one thread
only however, it uses wake_up_all() in __wbt_done(), and then
uses the following check in __wbt_wait to have only one thread
actually get out of the wait loop:

if (waitqueue_active(&rqw->wait) &&
            rqw->wait.head.next != &wait->entry)
                return false;

The problem with this is that the wait entry in wbt_wait is
define with DEFINE_WAIT, which uses the autoremove wakeup function.
That means that the above check is invalid - the wait entry will
have been removed from the queue already by the time we hit the
check in the loop.

Secondly, auto-removing the wait entries also means that the wait
queue essentially gets reordered "randomly" (e.g. threads re-add
themselves in the order they got to run after being woken up).
Additionally, new requests entering wbt_wait might overtake requests
that were queued earlier, because the wait queue will be
(temporarily) empty after the wake_up_all, so the waitqueue_active
check will not stop them. This can cause certain threads to starve
under high load.

The fix is to leave the woken up requests in the queue and remove
them in finish_wait() once the current thread breaks out of the
wait loop in __wbt_wait. This will ensure new requests always
end up at the back of the queue, and they won't overtake requests
that are already in the wait queue. With that change, the loop
in wbt_wait is also in line with many other wait loops in the kernel.
Waking up just one thread drastically reduces lock contention, as
does moving the wait queue add/remove out of the loop.

A significant drop in lockdep's lock contention numbers is seen when
running the test application on the patched kernel.

Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-07 14:40:49 -06:00
Jens Axboe 05b9ba4b55 Linux 4.18-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
 sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
 1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
 9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
 mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
 L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
 vJJlibI=
 =42a5
 -----END PGP SIGNATURE-----

Merge tag 'v4.18-rc6' into for-4.19/block2

Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-08-05 19:32:09 -06:00
Linus Torvalds a32e236eb9 Partially revert "block: fail op_is_write() requests to read-only partitions"
It turns out that commit 721c7fc701 ("block: fail op_is_write()
requests to read-only partitions"), while obviously correct, causes
problems for some older lvm2 installations.

The reason is that the lvm snapshotting will continue to write to the
snapshow COW volume, even after the volume has been marked read-only.
End result: snapshot failure.

This has actually been fixed in newer version of the lvm2 tool, but the
old tools still exist, and the breakage was reported both in the kernel
bugzilla and in the Debian bugzilla:

  https://bugzilla.kernel.org/show_bug.cgi?id=200439
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442

The lvm2 fix is here

  https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3

but until everybody has updated to recent versions, we'll have to weaken
the "never write to read-only partitions" check.  It now allows the
write to happen, but causes a warning, something like this:

  generic_make_request: Trying to write to read-only block-device dm-3 (partno X)
  Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi
  CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3
  Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016
  Workqueue: ksnaphd do_metadata
  RIP: 0010:generic_make_request_checks+0x4ac/0x600
  ...
  Call Trace:
   generic_make_request+0x64/0x400
   submit_bio+0x6c/0x140
   dispatch_io+0x287/0x430
   sync_io+0xc3/0x120
   dm_io+0x1f8/0x220
   do_metadata+0x1d/0x30
   process_one_work+0x1b9/0x3e0
   worker_thread+0x2b/0x3c0
   kthread+0x113/0x130
   ret_from_fork+0x35/0x40

Note that this is a "revert" in behavior only.  I'm leaving alone the
actual code cleanups in commit 721c7fc701, but letting the previously
uncaught request go through with a warning instead of stopping it.

Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Reported-and-tested-by: WGH <wgh@torlan.ru>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-04 18:19:55 -07:00
Linus Torvalds 71abe04269 for-linus-20180803
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltkcI4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuByD/9Pf8mfsv/WDJposa5pRqnY18dtBNdfHjsb
 t6vuBZiLxe86LDo1Y3e7663u+kgraARyswUpNTv5iAKseVkwIcz0RTArQJkaDLy7
 AD2BEfLvpcacAoWZ96MiayjQabGXEnTWqiKnIlcbc6J4/LIfdYYZHd+SXUXH0R81
 BaXPANL222srxuGbBifsfZ/yXxNPmBYfT1mTaFU6cKAftVO/24RV/MbwkU4cgym7
 0ZM/iBDkFDy6eV3XhhGeX1miDhGa2zNM/NHvsnSsP8jQqlyJW+lpI5dRPEoH1Fzp
 ecpxSH/HtjW5GYMwP7qhHMth/XHhjkOmNDRblkRWYrMGxgy7mAcSR8axdjqTdDxD
 +vlez+b9uPF3qXUk1D2NkW3RmTSkxjV6ztzc+yddbFSzqmQcHSqH+7ohaHnQRefX
 IGngTOq5pSmluNmrul48y/wmkSwVI1/jrteOfJfhJXWL0pY65g+8KWQxLmeZHke7
 ytFu6msOsVmZjDQ7tckSxSLMB+frvzuc/h+eOBTx8ida0bjopVlfNUcLQRD4kZfy
 xHn6nPxGsdiq2PM71nxYvqoUfmZnIE3o0A9+IB4OIp5bLVAGt2/fbWV3llunGK0A
 JM8UmyOysh71xh288VyE/hPz/7tyea/KgTaTm0jdTkTeoHH2isHsgWWvYB3zjeEn
 /1c7o2T13Q==
 =Ml3z
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180803' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
 "Just a single fix, from Ming, fixing a regression in this cycle where
  the busy tag iteration was changed to only calling the callback
  function for requests that are started. We really want all non-free
  requests.

  This fixes a boot regression on certain VM setups"

* tag 'for-linus-20180803' of git://git.kernel.dk/linux-block:
  blk-mq: fix blk_mq_tagset_busy_iter
2018-08-03 10:43:56 -07:00
Ming Lei 2d5ba0e2de blk-mq: fix blk_mq_tagset_busy_iter
Commit d250bf4e776ff09d5("blk-mq: only iterate over inflight requests
in blk_mq_tagset_busy_iter") uses 'blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT'
to replace 'blk_mq_request_started(req)', this way is wrong, and causes
lots of test system hang during booting.

Fix the issue by using blk_mq_request_started(req) inside bt_tags_iter().

Fixes: d250bf4e77 ("blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter")
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Matt Hart <matthew.hart@linaro.org>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: John Garry <john.garry@huawei.com>
Cc: Hannes Reinecke <hare@suse.com>,
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: linux-scsi@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Mark Brown <broonie@kernel.org>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-02 14:47:20 -06:00
Ming Lei 75d6e175fc blk-mq: fix updating tags depth
The passed 'nr' from userspace represents the total depth, meantime
inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
and 'nr_reserved_tags' stores the reserved part.

There are two issues in blk_mq_tag_update_depth() now:

1) for growing tags, we should have used the passed 'nr', and keep the
number of reserved tags not changed.

2) the passed 'nr' should have been used for checking against
'tags->nr_tags', instead of number of the normal part.

This patch fixes the above two cases, and avoids kernel crash caused
by wrong resizing sbitmap queue.

Cc: "Ewan D. Milne" <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Tested by: Marco Patalano <mpatalan@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-02 14:41:58 -06:00
Ming Lei b233f12704 block: really disable runtime-pm for blk-mq
Runtime PM isn't ready for blk-mq yet, and commit 765e40b675 ("block:
disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
it can't take effect in that way since user space still can switch
it on via 'echo auto > /sys/block/sdN/device/power/control'.

This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
and fixes all kinds of PM related kernel crash.

Cc: Tomas Janousek <tomi@nomi.cz>
Cc: Przemek Socha <soprwa@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: <stable@vger.kernel.org>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-02 10:36:02 -06:00
Dennis Zhou (Facebook) c480bcf97b block: make iolatency avg_lat exponentially decay
Currently, avg_lat is calculated by accumulating the mean of every
window in a long running cumulative average. As time goes on, the metric
becomes less and less useful due to the accumulated history.

This patch reuses the same calculation done in load averages to make the
avg_lat metric more lively. Unlike load averages, the avg only advances
when a window elapses (due to an io). Idle periods extend the most
recent window. Bucketing is used to limit the history of avg_lat by
binding it to the window size. So, the window range for 1/exp (decay
rate) is [1 min, 2.5 min) when windows elapse immediately.

The current sample window size is exposed in the debug info to enable
calculation of the window range.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-02 09:58:14 -06:00
Josef Bacik cc7ecc2585 blk-cgroup: hold the queue ref during throttling
The blkg lifetime is protected by the queue lifetime, so we need to put
the queue _after_ we're done using the blkg.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01 09:16:03 -06:00
Josef Bacik 52a1199ccd blk-iolatency: fix blkg leak in timer_fn
At this point we have a ref on the blkg, we need to drop it if we don't
have a iolat.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01 09:16:01 -06:00
zhong jiang 4725549192 block/bsg-lib: use PTR_ERR_OR_ZERO to simplify the flow path
Simplify the code by using the PTR_ERR_OR_ZERO, instead of the
open code. It is better.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01 09:13:03 -06:00
xiao jin 54648cf1ec block: blk_init_allocated_queue() set q->fq as NULL in the fail case
We find the memory use-after-free issue in __blk_drain_queue()
on the kernel 4.14. After read the latest kernel 4.18-rc6 we
think it has the same problem.

Memory is allocated for q->fq in the blk_init_allocated_queue().
If the elevator init function called with error return, it will
run into the fail case to free the q->fq.

Then the __blk_drain_queue() uses the same memory after the free
of the q->fq, it will lead to the unpredictable event.

The patch is to set q->fq as NULL in the fail case of
blk_init_allocated_queue().

Fixes: commit 7c94e1c157 ("block: introduce blk_flush_queue to drive flush machinery")
Cc: <stable@vger.kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: xiao jin <jin.xiao@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-30 08:28:39 -06:00
Max Gurtovoy 10c41ddd61 block: move dif_prepare/dif_complete functions to block layer
Currently these functions are implemented in the scsi layer, but their
actual place should be the block layer since T10-PI is a general data
integrity feature that is used in the nvme protocol as well. Also, use
the tuple size from the integrity profile since it may vary between
integrity types.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-30 08:27:02 -06:00
Linus Torvalds eb181a814c for-linus-20180727
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltbc20QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgh4D/9GYQcjk9qLVFxkv5ucAUvCuxEL6gjsMf4W
 M/QdxVIrwh3zpvsH++2IXXn+xH+UjujMA5NkzhsSr4+hsSO2iAGOYMJbroNfhsTD
 onvQQ6NTaHPu/+PZs0otVK4KMWHwZGWOV6YU00TWTfRgzRmGEsSMe91oeBIXVv9w
 v6d09twaLSY0lUkAAbcdu5fuFBtXu4Bxy60qyHEKkAdWWHEUYaZLrODhVjoGg2V4
 KdAWS5X4A6kJMcPcoOvG6RFtpf71boaip9o/DRLUWhGdIQnI38UgSCUmz1XMYnik
 Sq8r74vqCm8IhIOLTlxnPrMHHbKv7JZhY3Ow9fxnS6HZRNI0aPX31Yml6NULqnWh
 MsQh+6gZXd3xC1O7txEQn4a15Lk0OLXa8HJcIn5ADNxqz5/r/g0mPUG9HmPSIalO
 ISFF/9UKQFcAd0RjHR+bEEH2VMznz59UWKfdOsmwFZtZSCmR1ucj0xAKDj+oP1JS
 ZsgZ09K2GezrL4GEueocISo9ACIWgDWH8T7/bTxlBok0IYbybAfmOe+MZInL1Tf4
 pklmoXm3ntgV3Pq8Ptk05LYyIgAaUIltuSiR3AFaXIADX0wNtV0ZgysIWgHf3BSA
 18j+I1yPG1IwBdM8xNwxi56xMQR84uY5tsIyafbfj+laRI2nH5OIYjNZnrKpm957
 4xZUgIECBA==
 =2ogY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180727' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Bigger than usual at this time, mostly due to the O_DIRECT corruption
  issue and the fact that I was on vacation last week. This contains:

   - NVMe pull request with two fixes for the FC code, and two target
     fixes (Christoph)

   - a DIF bio reset iteration fix (Greg Edwards)

   - two nbd reply and requeue fixes (Josef)

   - SCSI timeout fixup (Keith)

   - a small series that fixes an issue with bio_iov_iter_get_pages(),
     which ended up causing corruption for larger sized O_DIRECT writes
     that ended up racing with buffered writes (Martin Wilck)"

* tag 'for-linus-20180727' of git://git.kernel.dk/linux-block:
  block: reset bi_iter.bi_done after splitting bio
  block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  blkdev: __blkdev_direct_IO_simple: fix leak in error case
  block: bio_iov_iter_get_pages: fix size of last iovec
  nvmet: only check for filebacking on -ENOTBLK
  nvmet: fixup crash on NULL device path
  scsi: set timed out out mq requests to complete
  blk-mq: export setting request completion state
  nvme: if_ready checks to fail io to deleting controller
  nvmet-fc: fix target sgl list on large transfers
  nbd: handle unexpected replies better
  nbd: don't requeue the same request twice.
2018-07-27 12:51:00 -07:00
Mauricio Faria de Oliveira d43fdae7ba partitions/aix: append null character to print data from disk
Even if properly initialized, the lvname array (i.e., strings)
is read from disk, and might contain corrupt data (e.g., lack
the null terminating character for strings).

So, make sure the partition name string used in pr_warn() has
the null terminating character.

Fixes: 6ceea22bbb ("partitions: add aix lvm partition support files")
Suggested-by: Daniel J. Axtens <daniel.axtens@canonical.com>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:17:41 -06:00
Mauricio Faria de Oliveira 14cb2c8a6c partitions/aix: fix usage of uninitialized lv_info and lvname structures
The if-block that sets a successful return value in aix_partition()
uses 'lvip[].pps_per_lv' and 'n[].name' potentially uninitialized.

For example, if 'numlvs' is zero or alloc_lvn() fails, neither is
initialized, but are used anyway if alloc_pvd() succeeds after it.

So, make the alloc_pvd() call conditional on their initialization.

This has been hit when attaching an apparently corrupted/stressed
AIX LUN, misleading the kernel to pr_warn() invalid data and hang.

    [...] partition (null) (11 pp's found) is not contiguous
    [...] partition (null) (2 pp's found) is not contiguous
    [...] partition (null) (3 pp's found) is not contiguous
    [...] partition (null) (64 pp's found) is not contiguous

Fixes: 6ceea22bbb ("partitions: add aix lvm partition support files")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:17:39 -06:00
Greg Edwards 5151842b9d block: reset bi_iter.bi_done after splitting bio
After the bio has been updated to represent the remaining sectors, reset
bi_done so bio_rewind_iter() does not rewind further than it should.

This resolves a bio_integrity_process() failure on reads where the
original request was split.

Fixes: 63573e359d ("bio-integrity: Restore original iterator on verify stage")
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:10:34 -06:00
Greg Edwards 359f642700 block: move bio_integrity_{intervals,bytes} into blkdev.h
This allows bio_integrity_bytes() to be called from drivers instead of
open coding it.

Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 15:49:41 -06:00
Martin Wilck 17d51b10d7 block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
bio_iov_iter_get_pages() currently only adds pages for the next non-zero
segment from the iov_iter to the bio. That's suboptimal for callers,
which typically try to pin as many pages as fit into the bio. This patch
converts the current bio_iov_iter_get_pages() into a static helper, and
introduces a new helper that allocates as many pages as

 1) fit into the bio,
 2) are present in the iov_iter,
 3) and can be pinned by MM.

Error is returned only if zero pages could be pinned. Because of 3), a
zero return value doesn't necessarily mean all pages have been pinned.
Callers that have to pin every page in the iov_iter must still call this
function in a loop (this is currently the case).

This change matters most for __blkdev_direct_IO_simple(), which calls
bio_iov_iter_get_pages() only once. If it obtains less pages than
requested, it returns a "short write" or "short read", and
__generic_file_write_iter() falls back to buffered writes, which may
lead to data corruption.

Fixes: 72ecad22d9 ("block: support a full bio worth of IO for simplified bdev direct-io")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 11:52:36 -06:00
Martin Wilck b403ea2404 block: bio_iov_iter_get_pages: fix size of last iovec
If the last page of the bio is not "full", the length of the last
vector slot needs to be corrected. This slot has the index
(bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
array, which is shifted by the value of bio->bi_vcnt at function
invocation, the correct index is (nr_pages - 1).

v2: improved readability following suggestions from Ming Lei.
v3: followed a formatting suggestion from Christoph Hellwig.

Fixes: 2cefe4dbaa ("block: add bio_iov_iter_get_pages()")
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 11:52:29 -06:00