Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block

Pull core block layer updates from Jens Axboe:
 "This is the main pull request for block storage for 4.15-rc1.

  Nothing out of the ordinary in here, and no API changes or anything
  like that. Just various new features for drivers, core changes, etc.
  In particular, this pull request contains:

   - A patch series from Bart, closing the whole on blk/scsi-mq queue
     quescing.

   - A series from Christoph, building towards hidden gendisks (for
     multipath) and ability to move bio chains around.

   - NVMe
        - Support for native multipath for NVMe (Christoph).
        - Userspace notifications for AENs (Keith).
        - Command side-effects support (Keith).
        - SGL support (Chaitanya Kulkarni)
        - FC fixes and improvements (James Smart)
        - Lots of fixes and tweaks (Various)

   - bcache
        - New maintainer (Michael Lyle)
        - Writeback control improvements (Michael)
        - Various fixes (Coly, Elena, Eric, Liang, et al)

   - lightnvm updates, mostly centered around the pblk interface
     (Javier, Hans, and Rakesh).

   - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

   - Writeback series that fix the much discussed hundreds of millions
     of sync-all units. This goes all the way, as discussed previously
     (me).

   - Fix for missing wakeup on writeback timer adjustments (Yafang
     Shao).

   - Fix laptop mode on blk-mq (me).

   - {mq,name} tupple lookup for IO schedulers, allowing us to have
     alias names. This means you can use 'deadline' on both !mq and on
     mq (where it's called mq-deadline). (me).

   - blktrace race fix, oopsing on sg load (me).

   - blk-mq optimizations (me).

   - Obscure waitqueue race fix for kyber (Omar).

   - NBD fixes (Josef).

   - Disable writeback throttling by default on bfq, like we do on cfq
     (Luca Miccio).

   - Series from Ming that enable us to treat flush requests on blk-mq
     like any other request. This is a really nice cleanup.

   - Series from Ming that improves merging on blk-mq with schedulers,
     getting us closer to flipping the switch on scsi-mq again.

   - BFQ updates (Paolo).

   - blk-mq atomic flags memory ordering fixes (Peter Z).

   - Loop cgroup support (Shaohua).

   - Lots of minor fixes from lots of different folks, both for core and
     driver code"

* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
  nvme: fix visibility of "uuid" ns attribute
  blk-mq: fixup some comment typos and lengths
  ide: ide-atapi: fix compile error with defining macro DEBUG
  blk-mq: improve tag waiting setup for non-shared tags
  brd: remove unused brd_mutex
  blk-mq: only run the hardware queue if IO is pending
  block: avoid null pointer dereference on null disk
  fs: guard_bio_eod() needs to consider partitions
  xtensa/simdisk: fix compile error
  nvme: expose subsys attribute to sysfs
  nvme: create 'slaves' and 'holders' entries for hidden controllers
  block: create 'slaves' and 'holders' entries for hidden gendisks
  nvme: also expose the namespace identification sysfs files for mpath nodes
  nvme: implement multipath access to nvme subsystems
  nvme: track shared namespaces
  nvme: introduce a nvme_ns_ids structure
  nvme: track subsystems
  block, nvme: Introduce blk_mq_req_flags_t
  block, scsi: Make SCSI quiesce and resume work reliably
  block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
  ...
This commit is contained in:
Linus Torvalds 2017-11-14 15:32:19 -08:00
commit e2c5923c34
131 changed files with 5485 additions and 3104 deletions

View File

@ -1,5 +0,0 @@
What: /proc/sys/vm/nr_pdflush_threads
Date: June 2012
Contact: Wanpeng Li <liwp@linux.vnet.ibm.com>
Description: Since pdflush is replaced by per-BDI flusher, the interface of old pdflush
exported in /proc/sys/vm/ should be removed.

View File

@ -216,10 +216,9 @@ may need to abort DMA operations and revert to PIO for the transfer, in
which case a virtual mapping of the page is required. For SCSI it is also
done in some scenarios where the low level driver cannot be trusted to
handle a single sg entry correctly. The driver is expected to perform the
kmaps as needed on such occasions using the __bio_kmap_atomic and bio_kmap_irq
routines as appropriate. A driver could also use the blk_queue_bounce()
routine on its own to bounce highmem i/o to low memory for specific requests
if so desired.
kmaps as needed on such occasions as appropriate. A driver could also use
the blk_queue_bounce() routine on its own to bounce highmem i/o to low
memory for specific requests if so desired.
iii. The i/o scheduler algorithm itself can be replaced/set as appropriate
@ -1137,8 +1136,8 @@ use dma_map_sg for scatter gather) to be able to ship it to the driver. For
PIO drivers (or drivers that need to revert to PIO transfer once in a
while (IDE for example)), where the CPU is doing the actual data
transfer a virtual mapping is needed. If the driver supports highmem I/O,
(Sec 1.1, (ii) ) it needs to use __bio_kmap_atomic and bio_kmap_irq to
temporarily map a bio into the virtual address space.
(Sec 1.1, (ii) ) it needs to use kmap_atomic or similar to temporarily map
a bio into the virtual address space.
8. Prior/Related/Impacted patches

View File

@ -38,7 +38,7 @@ gb=[Size in GB]: Default: 250GB
bs=[Block size (in bytes)]: Default: 512 bytes
The block size reported to the system.
nr_devices=[Number of devices]: Default: 2
nr_devices=[Number of devices]: Default: 1
Number of block devices instantiated. They are instantiated as /dev/nullb0,
etc.
@ -52,13 +52,13 @@ irqmode=[0-2]: Default: 1-Soft-irq
2: Timer: Waits a specific period (completion_nsec) for each IO before
completion.
completion_nsec=[ns]: Default: 10.000ns
completion_nsec=[ns]: Default: 10,000ns
Combined with irqmode=2 (timer). The time each completion event must wait.
submit_queues=[0..nr_cpus]:
submit_queues=[1..nr_cpus]:
The number of submission queues attached to the device driver. If unset, it
defaults to 1 on single-queue and bio-based instances. For multi-queue,
it is ignored when use_per_node_hctx module parameter is 1.
defaults to 1. For multi-queue, it is ignored when use_per_node_hctx module
parameter is 1.
hw_queue_depth=[0..qdepth]: Default: 64
The hardware queue depth of the device.
@ -73,3 +73,12 @@ use_per_node_hctx=[0/1]: Default: 0
use_lightnvm=[0/1]: Default: 0
Register device with LightNVM. Requires blk-mq and CONFIG_NVM to be enabled.
no_sched=[0/1]: Default: 0
0: nullb* use default blk-mq io scheduler.
1: nullb* doesn't use io scheduler.
shared_tags=[0/1]: Default: 0
0: Tag set is not shared.
1: Tag set shared between devices for blk-mq. Only makes sense with
nr_devices > 1, otherwise there's no tag set to share.

View File

@ -2562,10 +2562,12 @@ S: Maintained
F: drivers/net/hamradio/baycom*
BCACHE (BLOCK LAYER CACHE)
M: Michael Lyle <mlyle@lyle.org>
M: Kent Overstreet <kent.overstreet@gmail.com>
L: linux-bcache@vger.kernel.org
W: http://bcache.evilpiepirate.org
S: Orphan
C: irc://irc.oftc.net/bcache
S: Maintained
F: drivers/md/bcache/
BDISP ST MEDIA DRIVER
@ -12085,7 +12087,6 @@ F: drivers/mmc/host/sdhci-omap.c
SECURE ENCRYPTING DEVICE (SED) OPAL DRIVER
M: Scott Bauer <scott.bauer@intel.com>
M: Jonathan Derrick <jonathan.derrick@intel.com>
M: Rafael Antognolli <rafael.antognolli@intel.com>
L: linux-block@vger.kernel.org
S: Supported
F: block/sed*

View File

@ -110,13 +110,13 @@ static blk_qc_t simdisk_make_request(struct request_queue *q, struct bio *bio)
sector_t sector = bio->bi_iter.bi_sector;
bio_for_each_segment(bvec, bio, iter) {
char *buffer = __bio_kmap_atomic(bio, iter);
char *buffer = kmap_atomic(bvec.bv_page) + bvec.bv_offset;
unsigned len = bvec.bv_len >> SECTOR_SHIFT;
simdisk_transfer(dev, sector, len, buffer,
bio_data_dir(bio) == WRITE);
sector += len;
__bio_kunmap_atomic(buffer);
kunmap_atomic(buffer);
}
bio_endio(bio);

View File

@ -108,6 +108,7 @@
#include "blk-mq-tag.h"
#include "blk-mq-sched.h"
#include "bfq-iosched.h"
#include "blk-wbt.h"
#define BFQ_BFQQ_FNS(name) \
void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) \
@ -724,6 +725,44 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
}
}
static unsigned int bfq_wr_duration(struct bfq_data *bfqd)
{
u64 dur;
if (bfqd->bfq_wr_max_time > 0)
return bfqd->bfq_wr_max_time;
dur = bfqd->RT_prod;
do_div(dur, bfqd->peak_rate);
/*
* Limit duration between 3 and 13 seconds. Tests show that
* higher values than 13 seconds often yield the opposite of
* the desired result, i.e., worsen responsiveness by letting
* non-interactive and non-soft-real-time applications
* preserve weight raising for a too long time interval.
*
* On the other end, lower values than 3 seconds make it
* difficult for most interactive tasks to complete their jobs
* before weight-raising finishes.
*/
if (dur > msecs_to_jiffies(13000))
dur = msecs_to_jiffies(13000);
else if (dur < msecs_to_jiffies(3000))
dur = msecs_to_jiffies(3000);
return dur;
}
/* switch back from soft real-time to interactive weight raising */
static void switch_back_to_interactive_wr(struct bfq_queue *bfqq,
struct bfq_data *bfqd)
{
bfqq->wr_coeff = bfqd->bfq_wr_coeff;
bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
bfqq->last_wr_start_finish = bfqq->wr_start_at_switch_to_srt;
}
static void
bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
struct bfq_io_cq *bic, bool bfq_already_existing)
@ -750,10 +789,16 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
if (bfqq->wr_coeff > 1 && (bfq_bfqq_in_large_burst(bfqq) ||
time_is_before_jiffies(bfqq->last_wr_start_finish +
bfqq->wr_cur_max_time))) {
bfq_log_bfqq(bfqq->bfqd, bfqq,
"resume state: switching off wr");
bfqq->wr_coeff = 1;
if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
!bfq_bfqq_in_large_burst(bfqq) &&
time_is_after_eq_jiffies(bfqq->wr_start_at_switch_to_srt +
bfq_wr_duration(bfqd))) {
switch_back_to_interactive_wr(bfqq, bfqd);
} else {
bfqq->wr_coeff = 1;
bfq_log_bfqq(bfqq->bfqd, bfqq,
"resume state: switching off wr");
}
}
/* make sure weight will be updated, however we got here */
@ -1173,33 +1218,22 @@ static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
return wr_or_deserves_wr;
}
static unsigned int bfq_wr_duration(struct bfq_data *bfqd)
/*
* Return the farthest future time instant according to jiffies
* macros.
*/
static unsigned long bfq_greatest_from_now(void)
{
u64 dur;
return jiffies + MAX_JIFFY_OFFSET;
}
if (bfqd->bfq_wr_max_time > 0)
return bfqd->bfq_wr_max_time;
dur = bfqd->RT_prod;
do_div(dur, bfqd->peak_rate);
/*
* Limit duration between 3 and 13 seconds. Tests show that
* higher values than 13 seconds often yield the opposite of
* the desired result, i.e., worsen responsiveness by letting
* non-interactive and non-soft-real-time applications
* preserve weight raising for a too long time interval.
*
* On the other end, lower values than 3 seconds make it
* difficult for most interactive tasks to complete their jobs
* before weight-raising finishes.
*/
if (dur > msecs_to_jiffies(13000))
dur = msecs_to_jiffies(13000);
else if (dur < msecs_to_jiffies(3000))
dur = msecs_to_jiffies(3000);
return dur;
/*
* Return the farthest past time instant according to jiffies
* macros.
*/
static unsigned long bfq_smallest_from_now(void)
{
return jiffies - MAX_JIFFY_OFFSET;
}
static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,
@ -1216,7 +1250,19 @@ static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,
bfqq->wr_coeff = bfqd->bfq_wr_coeff;
bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
} else {
bfqq->wr_start_at_switch_to_srt = jiffies;
/*
* No interactive weight raising in progress
* here: assign minus infinity to
* wr_start_at_switch_to_srt, to make sure
* that, at the end of the soft-real-time
* weight raising periods that is starting
* now, no interactive weight-raising period
* may be wrongly considered as still in
* progress (and thus actually started by
* mistake).
*/
bfqq->wr_start_at_switch_to_srt =
bfq_smallest_from_now();
bfqq->wr_coeff = bfqd->bfq_wr_coeff *
BFQ_SOFTRT_WEIGHT_FACTOR;
bfqq->wr_cur_max_time =
@ -2016,10 +2062,27 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);
bic->was_in_burst_list = !hlist_unhashed(&bfqq->burst_list_node);
bic->saved_wr_coeff = bfqq->wr_coeff;
bic->saved_wr_start_at_switch_to_srt = bfqq->wr_start_at_switch_to_srt;
bic->saved_last_wr_start_finish = bfqq->last_wr_start_finish;
bic->saved_wr_cur_max_time = bfqq->wr_cur_max_time;
if (unlikely(bfq_bfqq_just_created(bfqq) &&
!bfq_bfqq_in_large_burst(bfqq))) {
/*
* bfqq being merged right after being created: bfqq
* would have deserved interactive weight raising, but
* did not make it to be set in a weight-raised state,
* because of this early merge. Store directly the
* weight-raising state that would have been assigned
* to bfqq, so that to avoid that bfqq unjustly fails
* to enjoy weight raising if split soon.
*/
bic->saved_wr_coeff = bfqq->bfqd->bfq_wr_coeff;
bic->saved_wr_cur_max_time = bfq_wr_duration(bfqq->bfqd);
bic->saved_last_wr_start_finish = jiffies;
} else {
bic->saved_wr_coeff = bfqq->wr_coeff;
bic->saved_wr_start_at_switch_to_srt =
bfqq->wr_start_at_switch_to_srt;
bic->saved_last_wr_start_finish = bfqq->last_wr_start_finish;
bic->saved_wr_cur_max_time = bfqq->wr_cur_max_time;
}
}
static void
@ -2897,24 +2960,6 @@ static unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + 4);
}
/*
* Return the farthest future time instant according to jiffies
* macros.
*/
static unsigned long bfq_greatest_from_now(void)
{
return jiffies + MAX_JIFFY_OFFSET;
}
/*
* Return the farthest past time instant according to jiffies
* macros.
*/
static unsigned long bfq_smallest_from_now(void)
{
return jiffies - MAX_JIFFY_OFFSET;
}
/**
* bfq_bfqq_expire - expire a queue.
* @bfqd: device owning the queue.
@ -3489,11 +3534,7 @@ static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
bfq_wr_duration(bfqd)))
bfq_bfqq_end_wr(bfqq);
else {
/* switch back to interactive wr */
bfqq->wr_coeff = bfqd->bfq_wr_coeff;
bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
bfqq->last_wr_start_finish =
bfqq->wr_start_at_switch_to_srt;
switch_back_to_interactive_wr(bfqq, bfqd);
bfqq->entity.prio_changed = 1;
}
}
@ -3685,16 +3726,37 @@ void bfq_put_queue(struct bfq_queue *bfqq)
if (bfqq->ref)
return;
if (bfq_bfqq_sync(bfqq))
/*
* The fact that this queue is being destroyed does not
* invalidate the fact that this queue may have been
* activated during the current burst. As a consequence,
* although the queue does not exist anymore, and hence
* needs to be removed from the burst list if there,
* the burst size has not to be decremented.
*/
if (!hlist_unhashed(&bfqq->burst_list_node)) {
hlist_del_init(&bfqq->burst_list_node);
/*
* Decrement also burst size after the removal, if the
* process associated with bfqq is exiting, and thus
* does not contribute to the burst any longer. This
* decrement helps filter out false positives of large
* bursts, when some short-lived process (often due to
* the execution of commands by some service) happens
* to start and exit while a complex application is
* starting, and thus spawning several processes that
* do I/O (and that *must not* be treated as a large
* burst, see comments on bfq_handle_burst).
*
* In particular, the decrement is performed only if:
* 1) bfqq is not a merged queue, because, if it is,
* then this free of bfqq is not triggered by the exit
* of the process bfqq is associated with, but exactly
* by the fact that bfqq has just been merged.
* 2) burst_size is greater than 0, to handle
* unbalanced decrements. Unbalanced decrements may
* happen in te following case: bfqq is inserted into
* the current burst list--without incrementing
* bust_size--because of a split, but the current
* burst list is not the burst list bfqq belonged to
* (see comments on the case of a split in
* bfq_set_request).
*/
if (bfqq->bic && bfqq->bfqd->burst_size > 0)
bfqq->bfqd->burst_size--;
}
kmem_cache_free(bfq_pool, bfqq);
#ifdef CONFIG_BFQ_GROUP_IOSCHED
@ -4127,7 +4189,6 @@ static void __bfq_insert_request(struct bfq_data *bfqd, struct request *rq)
new_bfqq->allocated++;
bfqq->allocated--;
new_bfqq->ref++;
bfq_clear_bfqq_just_created(bfqq);
/*
* If the bic associated with the process
* issuing this request still points to bfqq
@ -4139,6 +4200,8 @@ static void __bfq_insert_request(struct bfq_data *bfqd, struct request *rq)
if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq)
bfq_merge_bfqqs(bfqd, RQ_BIC(rq),
bfqq, new_bfqq);
bfq_clear_bfqq_just_created(bfqq);
/*
* rq is about to be enqueued into new_bfqq,
* release rq reference on bfqq
@ -4424,6 +4487,34 @@ static struct bfq_queue *bfq_get_bfqq_handle_split(struct bfq_data *bfqd,
else {
bfq_clear_bfqq_in_large_burst(bfqq);
if (bic->was_in_burst_list)
/*
* If bfqq was in the current
* burst list before being
* merged, then we have to add
* it back. And we do not need
* to increase burst_size, as
* we did not decrement
* burst_size when we removed
* bfqq from the burst list as
* a consequence of a merge
* (see comments in
* bfq_put_queue). In this
* respect, it would be rather
* costly to know whether the
* current burst list is still
* the same burst list from
* which bfqq was removed on
* the merge. To avoid this
* cost, if bfqq was in a
* burst list, then we add
* bfqq to the current burst
* list without any further
* check. This can cause
* inappropriate insertions,
* but rarely enough to not
* harm the detection of large
* bursts significantly.
*/
hlist_add_head(&bfqq->burst_list_node,
&bfqd->burst_list);
}
@ -4775,7 +4866,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
bfq_init_root_group(bfqd->root_group, bfqd);
bfq_init_entity(&bfqd->oom_bfqq.entity, bfqd->root_group);
wbt_disable_default(q);
return 0;
out_free:

View File

@ -485,11 +485,8 @@ EXPORT_SYMBOL(bioset_integrity_create);
void bioset_integrity_free(struct bio_set *bs)
{
if (bs->bio_integrity_pool)
mempool_destroy(bs->bio_integrity_pool);
if (bs->bvec_integrity_pool)
mempool_destroy(bs->bvec_integrity_pool);
mempool_destroy(bs->bio_integrity_pool);
mempool_destroy(bs->bvec_integrity_pool);
}
EXPORT_SYMBOL(bioset_integrity_free);

View File

@ -400,7 +400,7 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
/**
* bio_alloc_bioset - allocate a bio for I/O
* @gfp_mask: the GFP_ mask given to the slab allocator
* @gfp_mask: the GFP_* mask given to the slab allocator
* @nr_iovecs: number of iovecs to pre-allocate
* @bs: the bio_set to allocate from.
*
@ -1931,11 +1931,8 @@ void bioset_free(struct bio_set *bs)
if (bs->rescue_workqueue)
destroy_workqueue(bs->rescue_workqueue);
if (bs->bio_pool)
mempool_destroy(bs->bio_pool);
if (bs->bvec_pool)
mempool_destroy(bs->bvec_pool);
mempool_destroy(bs->bio_pool);
mempool_destroy(bs->bvec_pool);
bioset_integrity_free(bs);
bio_put_slab(bs);
@ -2035,37 +2032,6 @@ int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css)
}
EXPORT_SYMBOL_GPL(bio_associate_blkcg);
/**
* bio_associate_current - associate a bio with %current
* @bio: target bio
*
* Associate @bio with %current if it hasn't been associated yet. Block
* layer will treat @bio as if it were issued by %current no matter which
* task actually issues it.
*
* This function takes an extra reference of @task's io_context and blkcg
* which will be put when @bio is released. The caller must own @bio,
* ensure %current->io_context exists, and is responsible for synchronizing
* calls to this function.
*/
int bio_associate_current(struct bio *bio)
{
struct io_context *ioc;
if (bio->bi_css)
return -EBUSY;
ioc = current->io_context;
if (!ioc)
return -ENOENT;
get_io_context_active(ioc);
bio->bi_ioc = ioc;
bio->bi_css = task_get_css(current, io_cgrp_id);
return 0;
}
EXPORT_SYMBOL_GPL(bio_associate_current);
/**
* bio_disassociate_task - undo bio_associate_current()
* @bio: target bio

View File

@ -1419,6 +1419,11 @@ int blkcg_policy_register(struct blkcg_policy *pol)
if (i >= BLKCG_MAX_POLS)
goto err_unlock;
/* Make sure cpd/pd_alloc_fn and cpd/pd_free_fn in pairs */
if ((!pol->cpd_alloc_fn ^ !pol->cpd_free_fn) ||
(!pol->pd_alloc_fn ^ !pol->pd_free_fn))
goto err_unlock;
/* register @pol */
pol->plid = i;
blkcg_policy[pol->plid] = pol;
@ -1452,7 +1457,7 @@ int blkcg_policy_register(struct blkcg_policy *pol)
return 0;
err_free_cpds:
if (pol->cpd_alloc_fn) {
if (pol->cpd_free_fn) {
list_for_each_entry(blkcg, &all_blkcgs, all_blkcgs_node) {
if (blkcg->cpd[pol->plid]) {
pol->cpd_free_fn(blkcg->cpd[pol->plid]);
@ -1492,7 +1497,7 @@ void blkcg_policy_unregister(struct blkcg_policy *pol)
/* remove cpds and unregister */
mutex_lock(&blkcg_pol_mutex);
if (pol->cpd_alloc_fn) {
if (pol->cpd_free_fn) {
list_for_each_entry(blkcg, &all_blkcgs, all_blkcgs_node) {
if (blkcg->cpd[pol->plid]) {
pol->cpd_free_fn(blkcg->cpd[pol->plid]);

View File

@ -333,11 +333,13 @@ EXPORT_SYMBOL(blk_stop_queue);
void blk_sync_queue(struct request_queue *q)
{
del_timer_sync(&q->timeout);
cancel_work_sync(&q->timeout_work);
if (q->mq_ops) {
struct blk_mq_hw_ctx *hctx;
int i;
cancel_delayed_work_sync(&q->requeue_work);
queue_for_each_hw_ctx(q, hctx, i)
cancel_delayed_work_sync(&hctx->run_work);
} else {
@ -346,6 +348,37 @@ void blk_sync_queue(struct request_queue *q)
}
EXPORT_SYMBOL(blk_sync_queue);
/**
* blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY
* @q: request queue pointer
*
* Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
* set and 1 if the flag was already set.
*/
int blk_set_preempt_only(struct request_queue *q)
{
unsigned long flags;
int res;
spin_lock_irqsave(q->queue_lock, flags);
res = queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
spin_unlock_irqrestore(q->queue_lock, flags);
return res;
}
EXPORT_SYMBOL_GPL(blk_set_preempt_only);
void blk_clear_preempt_only(struct request_queue *q)
{
unsigned long flags;
spin_lock_irqsave(q->queue_lock, flags);
queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
wake_up_all(&q->mq_freeze_wq);
spin_unlock_irqrestore(q->queue_lock, flags);
}
EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
/**
* __blk_run_queue_uncond - run a queue whether or not it has been stopped
* @q: The queue to run
@ -610,6 +643,9 @@ void blk_set_queue_dying(struct request_queue *q)
}
spin_unlock_irq(q->queue_lock);
}
/* Make blk_queue_enter() reexamine the DYING flag. */
wake_up_all(&q->mq_freeze_wq);
}
EXPORT_SYMBOL_GPL(blk_set_queue_dying);
@ -718,7 +754,7 @@ static void free_request_size(void *element, void *data)
int blk_init_rl(struct request_list *rl, struct request_queue *q,
gfp_t gfp_mask)
{
if (unlikely(rl->rq_pool))
if (unlikely(rl->rq_pool) || q->mq_ops)
return 0;
rl->q = q;
@ -760,15 +796,38 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
}
EXPORT_SYMBOL(blk_alloc_queue);
int blk_queue_enter(struct request_queue *q, bool nowait)
/**
* blk_queue_enter() - try to increase q->q_usage_counter
* @q: request queue pointer
* @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
*/
int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
{
const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
while (true) {
bool success = false;
int ret;
if (percpu_ref_tryget_live(&q->q_usage_counter))
rcu_read_lock_sched();
if (percpu_ref_tryget_live(&q->q_usage_counter)) {
/*
* The code that sets the PREEMPT_ONLY flag is
* responsible for ensuring that that flag is globally
* visible before the queue is unfrozen.
*/
if (preempt || !blk_queue_preempt_only(q)) {
success = true;
} else {
percpu_ref_put(&q->q_usage_counter);
}
}
rcu_read_unlock_sched();
if (success)
return 0;
if (nowait)
if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
/*
@ -781,7 +840,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
smp_rmb();
ret = wait_event_interruptible(q->mq_freeze_wq,
!atomic_read(&q->mq_freeze_depth) ||
(atomic_read(&q->mq_freeze_depth) == 0 &&
(preempt || !blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
@ -844,6 +904,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
setup_timer(&q->backing_dev_info->laptop_mode_wb_timer,
laptop_mode_timer_fn, (unsigned long) q);
setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
INIT_WORK(&q->timeout_work, NULL);
INIT_LIST_HEAD(&q->queue_head);
INIT_LIST_HEAD(&q->timeout_list);
INIT_LIST_HEAD(&q->icq_list);
@ -1154,7 +1215,7 @@ int blk_update_nr_requests(struct request_queue *q, unsigned int nr)
* @rl: request list to allocate from
* @op: operation and flags
* @bio: bio to allocate request for (can be %NULL)
* @gfp_mask: allocation mask
* @flags: BLQ_MQ_REQ_* flags
*
* Get a free request from @q. This function may fail under memory
* pressure or if @q is dead.
@ -1164,7 +1225,7 @@ int blk_update_nr_requests(struct request_queue *q, unsigned int nr)
* Returns request pointer on success, with @q->queue_lock *not held*.
*/
static struct request *__get_request(struct request_list *rl, unsigned int op,
struct bio *bio, gfp_t gfp_mask)
struct bio *bio, blk_mq_req_flags_t flags)
{
struct request_queue *q = rl->q;
struct request *rq;
@ -1173,6 +1234,8 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
struct io_cq *icq = NULL;
const bool is_sync = op_is_sync(op);
int may_queue;
gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
__GFP_DIRECT_RECLAIM;
req_flags_t rq_flags = RQF_ALLOCED;
lockdep_assert_held(q->queue_lock);
@ -1255,6 +1318,8 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
blk_rq_set_rl(rq, rl);
rq->cmd_flags = op;
rq->rq_flags = rq_flags;
if (flags & BLK_MQ_REQ_PREEMPT)
rq->rq_flags |= RQF_PREEMPT;
/* init elvpriv */
if (rq_flags & RQF_ELVPRIV) {
@ -1333,7 +1398,7 @@ rq_starved:
* @q: request_queue to allocate request from
* @op: operation and flags
* @bio: bio to allocate request for (can be %NULL)
* @gfp_mask: allocation mask
* @flags: BLK_MQ_REQ_* flags.
*
* Get a free request from @q. If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
* this function keeps retrying under memory pressure and fails iff @q is dead.
@ -1343,7 +1408,7 @@ rq_starved:
* Returns request pointer on success, with @q->queue_lock *not held*.
*/
static struct request *get_request(struct request_queue *q, unsigned int op,
struct bio *bio, gfp_t gfp_mask)
struct bio *bio, blk_mq_req_flags_t flags)
{
const bool is_sync = op_is_sync(op);
DEFINE_WAIT(wait);
@ -1355,7 +1420,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
rl = blk_get_rl(q, bio); /* transferred to @rq on success */
retry:
rq = __get_request(rl, op, bio, gfp_mask);
rq = __get_request(rl, op, bio, flags);
if (!IS_ERR(rq))
return rq;
@ -1364,7 +1429,7 @@ retry:
return ERR_PTR(-EAGAIN);
}
if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
if ((flags & BLK_MQ_REQ_NOWAIT) || unlikely(blk_queue_dying(q))) {
blk_put_rl(rl);
return rq;
}
@ -1391,20 +1456,28 @@ retry:
goto retry;
}
/* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */
static struct request *blk_old_get_request(struct request_queue *q,
unsigned int op, gfp_t gfp_mask)
unsigned int op, blk_mq_req_flags_t flags)
{
struct request *rq;
gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
__GFP_DIRECT_RECLAIM;
int ret = 0;
WARN_ON_ONCE(q->mq_ops);
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
ret = blk_queue_enter(q, flags);
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
rq = get_request(q, op, NULL, gfp_mask);
rq = get_request(q, op, NULL, flags);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
blk_queue_exit(q);
return rq;
}
@ -1415,25 +1488,40 @@ static struct request *blk_old_get_request(struct request_queue *q,
return rq;
}
struct request *blk_get_request(struct request_queue *q, unsigned int op,
gfp_t gfp_mask)
/**
* blk_get_request_flags - allocate a request
* @q: request queue to allocate a request for
* @op: operation (REQ_OP_*) and REQ_* flags, e.g. REQ_SYNC.
* @flags: BLK_MQ_REQ_* flags, e.g. BLK_MQ_REQ_NOWAIT.
*/
struct request *blk_get_request_flags(struct request_queue *q, unsigned int op,
blk_mq_req_flags_t flags)
{
struct request *req;
WARN_ON_ONCE(op & REQ_NOWAIT);
WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
if (q->mq_ops) {
req = blk_mq_alloc_request(q, op,
(gfp_mask & __GFP_DIRECT_RECLAIM) ?
0 : BLK_MQ_REQ_NOWAIT);
req = blk_mq_alloc_request(q, op, flags);
if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
q->mq_ops->initialize_rq_fn(req);
} else {
req = blk_old_get_request(q, op, gfp_mask);
req = blk_old_get_request(q, op, flags);
if (!IS_ERR(req) && q->initialize_rq_fn)
q->initialize_rq_fn(req);
}
return req;
}
EXPORT_SYMBOL(blk_get_request_flags);
struct request *blk_get_request(struct request_queue *q, unsigned int op,
gfp_t gfp_mask)
{
return blk_get_request_flags(q, op, gfp_mask & __GFP_DIRECT_RECLAIM ?
0 : BLK_MQ_REQ_NOWAIT);
}
EXPORT_SYMBOL(blk_get_request);
/**
@ -1576,6 +1664,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
blk_free_request(rl, req);
freed_request(rl, sync, rq_flags);
blk_put_rl(rl);
blk_queue_exit(q);
}
}
EXPORT_SYMBOL_GPL(__blk_put_request);
@ -1857,8 +1946,10 @@ get_rq:
* Grab a free request. This is might sleep but can not fail.
* Returns with the queue unlocked.
*/
req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
blk_queue_enter_live(q);
req = get_request(q, bio->bi_opf, bio, 0);
if (IS_ERR(req)) {
blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
@ -2200,8 +2291,10 @@ blk_qc_t generic_make_request(struct bio *bio)
current->bio_list = bio_list_on_stack;
do {
struct request_queue *q = bio->bi_disk->queue;
blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
BLK_MQ_REQ_NOWAIT : 0;
if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
if (likely(blk_queue_enter(q, flags) == 0)) {
struct bio_list lower, same;
/* Create a fresh bio_list for all subordinate requests */
@ -2241,6 +2334,40 @@ out:
}
EXPORT_SYMBOL(generic_make_request);
/**
* direct_make_request - hand a buffer directly to its device driver for I/O
* @bio: The bio describing the location in memory and on the device.
*
* This function behaves like generic_make_request(), but does not protect
* against recursion. Must only be used if the called driver is known
* to not call generic_make_request (or direct_make_request) again from
* its make_request function. (Calling direct_make_request again from
* a workqueue is perfectly fine as that doesn't recurse).
*/
blk_qc_t direct_make_request(struct bio *bio)
{
struct request_queue *q = bio->bi_disk->queue;
bool nowait = bio->bi_opf & REQ_NOWAIT;
blk_qc_t ret;
if (!generic_make_request_checks(bio))
return BLK_QC_T_NONE;
if (unlikely(blk_queue_enter(q, nowait ? BLK_MQ_REQ_NOWAIT : 0))) {
if (nowait && !blk_queue_dying(q))
bio->bi_status = BLK_STS_AGAIN;
else
bio->bi_status = BLK_STS_IOERR;
bio_endio(bio);
return BLK_QC_T_NONE;
}
ret = q->make_request_fn(q, bio);
blk_queue_exit(q);
return ret;
}
EXPORT_SYMBOL_GPL(direct_make_request);
/**
* submit_bio - submit a bio to the block device layer for I/O
* @bio: The &struct bio which describes the I/O
@ -2285,6 +2412,17 @@ blk_qc_t submit_bio(struct bio *bio)
}
EXPORT_SYMBOL(submit_bio);
bool blk_poll(struct request_queue *q, blk_qc_t cookie)
{
if (!q->poll_fn || !blk_qc_t_valid(cookie))
return false;
if (current->plug)
blk_flush_plug_list(current->plug, false);
return q->poll_fn(q, cookie);
}
EXPORT_SYMBOL_GPL(blk_poll);
/**
* blk_cloned_rq_check_limits - Helper function to check a cloned request
* for new the queue limits
@ -2350,7 +2488,7 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *
* bypass a potential scheduler on the bottom device for
* insert.
*/
blk_mq_request_bypass_insert(rq);
blk_mq_request_bypass_insert(rq, true);
return BLK_STS_OK;
}
@ -2464,20 +2602,22 @@ void blk_account_io_done(struct request *req)
* Don't process normal requests when queue is suspended
* or in the process of suspending/resuming
*/
static struct request *blk_pm_peek_request(struct request_queue *q,
struct request *rq)
static bool blk_pm_allow_request(struct request *rq)
{
if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
(q->rpm_status != RPM_ACTIVE && !(rq->rq_flags & RQF_PM))))
return NULL;
else
return rq;
switch (rq->q->rpm_status) {
case RPM_RESUMING:
case RPM_SUSPENDING:
return rq->rq_flags & RQF_PM;
case RPM_SUSPENDED:
return false;
}
return true;
}
#else
static inline struct request *blk_pm_peek_request(struct request_queue *q,
struct request *rq)
static bool blk_pm_allow_request(struct request *rq)
{
return rq;
return true;
}
#endif
@ -2517,6 +2657,48 @@ void blk_account_io_start(struct request *rq, bool new_io)
part_stat_unlock();
}
static struct request *elv_next_request(struct request_queue *q)
{
struct request *rq;
struct blk_flush_queue *fq = blk_get_flush_queue(q, NULL);
WARN_ON_ONCE(q->mq_ops);
while (1) {
list_for_each_entry(rq, &q->queue_head, queuelist) {
if (blk_pm_allow_request(rq))
return rq;
if (rq->rq_flags & RQF_SOFTBARRIER)
break;
}
/*
* Flush request is running and flush request isn't queueable
* in the drive, we can hold the queue till flush request is
* finished. Even we don't do this, driver can't dispatch next
* requests and will requeue them. And this can improve
* throughput too. For example, we have request flush1, write1,
* flush 2. flush1 is dispatched, then queue is hold, write1
* isn't inserted to queue. After flush1 is finished, flush2
* will be dispatched. Since disk cache is already clean,
* flush2 will be finished very soon, so looks like flush2 is
* folded to flush1.
* Since the queue is hold, a flag is set to indicate the queue
* should be restarted later. Please see flush_end_io() for
* details.
*/
if (fq->flush_pending_idx != fq->flush_running_idx &&
!queue_flush_queueable(q)) {
fq->flush_queue_delayed = 1;
return NULL;
}
if (unlikely(blk_queue_bypass(q)) ||
!q->elevator->type->ops.sq.elevator_dispatch_fn(q, 0))
return NULL;
}
}
/**
* blk_peek_request - peek at the top of a request queue
* @q: request queue to peek at
@ -2538,12 +2720,7 @@ struct request *blk_peek_request(struct request_queue *q)
lockdep_assert_held(q->queue_lock);
WARN_ON_ONCE(q->mq_ops);
while ((rq = __elv_next_request(q)) != NULL) {
rq = blk_pm_peek_request(q, rq);
if (!rq)
break;
while ((rq = elv_next_request(q)) != NULL) {
if (!(rq->rq_flags & RQF_STARTED)) {
/*
* This is the first time the device driver
@ -2695,6 +2872,27 @@ struct request *blk_fetch_request(struct request_queue *q)
}
EXPORT_SYMBOL(blk_fetch_request);
/*
* Steal bios from a request and add them to a bio list.
* The request must not have been partially completed before.
*/
void blk_steal_bios(struct bio_list *list, struct request *rq)
{
if (rq->bio) {
if (list->tail)
list->tail->bi_next = rq->bio;
else
list->head = rq->bio;
list->tail = rq->biotail;
rq->bio = NULL;
rq->biotail = NULL;
}
rq->__data_len = 0;
}
EXPORT_SYMBOL_GPL(blk_steal_bios);
/**
* blk_update_request - Special helper function for request stacking drivers
* @req: the request being processed

View File

@ -231,8 +231,13 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
/* release the tag's ownership to the req cloned from */
spin_lock_irqsave(&fq->mq_flush_lock, flags);
hctx = blk_mq_map_queue(q, flush_rq->mq_ctx->cpu);
blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
flush_rq->tag = -1;
if (!q->elevator) {
blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
flush_rq->tag = -1;
} else {
blk_mq_put_driver_tag_hctx(hctx, flush_rq);
flush_rq->internal_tag = -1;
}
}
running = &fq->flush_queue[fq->flush_running_idx];
@ -318,19 +323,26 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
blk_rq_init(q, flush_rq);
/*
* Borrow tag from the first request since they can't
* be in flight at the same time. And acquire the tag's
* ownership for flush req.
* In case of none scheduler, borrow tag from the first request
* since they can't be in flight at the same time. And acquire
* the tag's ownership for flush req.
*
* In case of IO scheduler, flush rq need to borrow scheduler tag
* just for cheating put/get driver tag.
*/
if (q->mq_ops) {
struct blk_mq_hw_ctx *hctx;
flush_rq->mq_ctx = first_rq->mq_ctx;
flush_rq->tag = first_rq->tag;
fq->orig_rq = first_rq;
hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
if (!q->elevator) {
fq->orig_rq = first_rq;
flush_rq->tag = first_rq->tag;
hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
} else {
flush_rq->internal_tag = first_rq->internal_tag;
}
}
flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
@ -394,6 +406,11 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
hctx = blk_mq_map_queue(q, ctx->cpu);
if (q->elevator) {
WARN_ON(rq->tag < 0);
blk_mq_put_driver_tag_hctx(hctx, rq);
}
/*
* After populating an empty queue, kick it to avoid stall. Read
* the comment in flush_end_io().
@ -463,7 +480,7 @@ void blk_insert_flush(struct request *rq)
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
if (q->mq_ops)
blk_mq_sched_insert_request(rq, false, true, false, false);
blk_mq_request_bypass_insert(rq, false);
else
list_add_tail(&rq->queuelist, &q->queue_head);
return;

View File

@ -275,51 +275,18 @@ static unsigned int __blkdev_sectors_to_bio_pages(sector_t nr_sects)
return min(pages, (sector_t)BIO_MAX_PAGES);
}
/**
* __blkdev_issue_zeroout - generate number of zero filed write bios
* @bdev: blockdev to issue
* @sector: start sector
* @nr_sects: number of sectors to write
* @gfp_mask: memory allocation flags (for bio_alloc)
* @biop: pointer to anchor bio
* @flags: controls detailed behavior
*
* Description:
* Zero-fill a block range, either using hardware offload or by explicitly
* writing zeroes to the device.
*
* Note that this function may fail with -EOPNOTSUPP if the driver signals
* zeroing offload support, but the device fails to process the command (for
* some devices there is no non-destructive way to verify whether this
* operation is actually supported). In this case the caller should call
* retry the call to blkdev_issue_zeroout() and the fallback path will be used.
*
* If a device is using logical block provisioning, the underlying space will
* not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
*
* If %flags contains BLKDEV_ZERO_NOFALLBACK, the function will return
* -EOPNOTSUPP if no explicit hardware offload for zeroing is provided.
*/
int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
unsigned flags)
static int __blkdev_issue_zero_pages(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
struct bio **biop)
{
int ret;
int bi_size = 0;
struct request_queue *q = bdev_get_queue(bdev);
struct bio *bio = *biop;
int bi_size = 0;
unsigned int sz;
sector_t bs_mask;
bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
if ((sector | nr_sects) & bs_mask)
return -EINVAL;
if (!q)
return -ENXIO;
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
biop, flags);
if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK))
goto out;
ret = 0;
while (nr_sects != 0) {
bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
gfp_mask);
@ -339,8 +306,46 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
}
*biop = bio;
out:
return ret;
return 0;
}
/**
* __blkdev_issue_zeroout - generate number of zero filed write bios
* @bdev: blockdev to issue
* @sector: start sector
* @nr_sects: number of sectors to write
* @gfp_mask: memory allocation flags (for bio_alloc)
* @biop: pointer to anchor bio
* @flags: controls detailed behavior
*
* Description:
* Zero-fill a block range, either using hardware offload or by explicitly
* writing zeroes to the device.
*
* If a device is using logical block provisioning, the underlying space will
* not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
*
* If %flags contains BLKDEV_ZERO_NOFALLBACK, the function will return
* -EOPNOTSUPP if no explicit hardware offload for zeroing is provided.
*/
int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
unsigned flags)
{
int ret;
sector_t bs_mask;
bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
if ((sector | nr_sects) & bs_mask)
return -EINVAL;
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
biop, flags);
if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK))
return ret;
return __blkdev_issue_zero_pages(bdev, sector, nr_sects, gfp_mask,
biop);
}
EXPORT_SYMBOL(__blkdev_issue_zeroout);
@ -360,18 +365,49 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout);
int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
{
int ret;
struct bio *bio = NULL;
int ret = 0;
sector_t bs_mask;
struct bio *bio;
struct blk_plug plug;
bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev);
bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
if ((sector | nr_sects) & bs_mask)
return -EINVAL;
retry:
bio = NULL;
blk_start_plug(&plug);
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
&bio, flags);
if (try_write_zeroes) {
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects,
gfp_mask, &bio, flags);
} else if (!(flags & BLKDEV_ZERO_NOFALLBACK)) {
ret = __blkdev_issue_zero_pages(bdev, sector, nr_sects,
gfp_mask, &bio);
} else {
/* No zeroing offload support */
ret = -EOPNOTSUPP;
}
if (ret == 0 && bio) {
ret = submit_bio_wait(bio);
bio_put(bio);
}
blk_finish_plug(&plug);
if (ret && try_write_zeroes) {
if (!(flags & BLKDEV_ZERO_NOFALLBACK)) {
try_write_zeroes = false;
goto retry;
}
if (!bdev_write_zeroes_sectors(bdev)) {
/*
* Zeroing offload support was indicated, but the
* device reported ILLEGAL REQUEST (for some devices
* there is no non-destructive way to verify whether
* WRITE ZEROES is actually supported).
*/
ret = -EOPNOTSUPP;
}
}
return ret;
}

View File

@ -54,7 +54,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(NOMERGES),
QUEUE_FLAG_NAME(SAME_COMP),
QUEUE_FLAG_NAME(FAIL_IO),
QUEUE_FLAG_NAME(STACKABLE),
QUEUE_FLAG_NAME(NONROT),
QUEUE_FLAG_NAME(IO_STAT),
QUEUE_FLAG_NAME(DISCARD),
@ -75,6 +74,7 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(REGISTERED),
QUEUE_FLAG_NAME(SCSI_PASSTHROUGH),
QUEUE_FLAG_NAME(QUIESCED),
QUEUE_FLAG_NAME(PREEMPT_ONLY),
};
#undef QUEUE_FLAG_NAME
@ -180,7 +180,6 @@ static const char *const hctx_state_name[] = {
HCTX_STATE_NAME(STOPPED),
HCTX_STATE_NAME(TAG_ACTIVE),
HCTX_STATE_NAME(SCHED_RESTART),
HCTX_STATE_NAME(TAG_WAITING),
HCTX_STATE_NAME(START_ON_RUN),
};
#undef HCTX_STATE_NAME

View File

@ -81,20 +81,103 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
} else
clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
if (blk_mq_hctx_has_pending(hctx)) {
blk_mq_run_hw_queue(hctx, true);
return true;
}
return false;
return blk_mq_run_hw_queue(hctx, true);
}
/*
* Only SCSI implements .get_budget and .put_budget, and SCSI restarts
* its queue by itself in its completion handler, so we don't need to
* restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
*/
static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
LIST_HEAD(rq_list);
do {
struct request *rq;
if (e->type->ops.mq.has_work &&
!e->type->ops.mq.has_work(hctx))
break;
if (!blk_mq_get_dispatch_budget(hctx))
break;
rq = e->type->ops.mq.dispatch_request(hctx);
if (!rq) {
blk_mq_put_dispatch_budget(hctx);
break;
}
/*
* Now this rq owns the budget which has to be released
* if this rq won't be queued to driver via .queue_rq()
* in blk_mq_dispatch_rq_list().
*/
list_add(&rq->queuelist, &rq_list);
} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
}
static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
struct blk_mq_ctx *ctx)
{
unsigned idx = ctx->index_hw;
if (++idx == hctx->nr_ctx)
idx = 0;
return hctx->ctxs[idx];
}
/*
* Only SCSI implements .get_budget and .put_budget, and SCSI restarts
* its queue by itself in its completion handler, so we don't need to
* restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
*/
static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
LIST_HEAD(rq_list);
struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
do {
struct request *rq;
if (!sbitmap_any_bit_set(&hctx->ctx_map))
break;
if (!blk_mq_get_dispatch_budget(hctx))
break;
rq = blk_mq_dequeue_from_ctx(hctx, ctx);
if (!rq) {
blk_mq_put_dispatch_budget(hctx);
break;
}
/*
* Now this rq owns the budget which has to be released
* if this rq won't be queued to driver via .queue_rq()
* in blk_mq_dispatch_rq_list().
*/
list_add(&rq->queuelist, &rq_list);
/* round robin for fair dispatch */
ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
WRITE_ONCE(hctx->dispatch_from, ctx);
}
/* return true if hw queue need to be run again */
void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
bool did_work = false;
LIST_HEAD(rq_list);
/* RCU or SRCU read lock is needed before checking quiesced flag */
@ -122,29 +205,34 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
* scheduler, we can no longer merge or sort them. So it's best to
* leave them there for as long as we can. Mark the hw queue as
* needing a restart in that case.
*
* We want to dispatch from the scheduler if there was nothing
* on the dispatch list or we were able to dispatch from the
* dispatch list.
*/
if (!list_empty(&rq_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
did_work = blk_mq_dispatch_rq_list(q, &rq_list);
} else if (!has_sched_dispatch) {
if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
if (has_sched_dispatch)
blk_mq_do_dispatch_sched(hctx);
else
blk_mq_do_dispatch_ctx(hctx);
}
} else if (has_sched_dispatch) {
blk_mq_do_dispatch_sched(hctx);
} else if (q->mq_ops->get_budget) {
/*
* If we need to get budget before queuing request, we
* dequeue request one by one from sw queue for avoiding
* to mess up I/O merge when dispatch runs out of resource.
*
* TODO: get more budgets, and dequeue more requests in
* one time.
*/
blk_mq_do_dispatch_ctx(hctx);
} else {
blk_mq_flush_busy_ctxs(hctx, &rq_list);
blk_mq_dispatch_rq_list(q, &rq_list);
}
/*
* We want to dispatch from the scheduler if we had no work left
* on the dispatch list, OR if we did have work but weren't able
* to make progress.
*/
if (!did_work && has_sched_dispatch) {
do {
struct request *rq;
rq = e->type->ops.mq.dispatch_request(hctx);
if (!rq)
break;
list_add(&rq->queuelist, &rq_list);
} while (blk_mq_dispatch_rq_list(q, &rq_list));
blk_mq_dispatch_rq_list(q, &rq_list, false);
}
}
@ -260,21 +348,21 @@ void blk_mq_sched_request_inserted(struct request *rq)
EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
bool has_sched,
struct request *rq)
{
if (rq->tag == -1) {
rq->rq_flags |= RQF_SORTED;
return false;
/* dispatch flush rq directly */
if (rq->rq_flags & RQF_FLUSH_SEQ) {
spin_lock(&hctx->lock);
list_add(&rq->queuelist, &hctx->dispatch);
spin_unlock(&hctx->lock);
return true;
}
/*
* If we already have a real request tag, send directly to
* the dispatch list.
*/
spin_lock(&hctx->lock);
list_add(&rq->queuelist, &hctx->dispatch);
spin_unlock(&hctx->lock);
return true;
if (has_sched)
rq->rq_flags |= RQF_SORTED;
return false;
}
/**
@ -339,21 +427,6 @@ done:
}
}
/*
* Add flush/fua to the queue. If we fail getting a driver tag, then
* punt to the requeue list. Requeue will re-invoke us from a context
* that's safe to block from.
*/
static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx,
struct request *rq, bool can_block)
{
if (blk_mq_get_driver_tag(rq, &hctx, can_block)) {
blk_insert_flush(rq);
blk_mq_run_hw_queue(hctx, true);
} else
blk_mq_add_to_requeue_list(rq, false, true);
}
void blk_mq_sched_insert_request(struct request *rq, bool at_head,
bool run_queue, bool async, bool can_block)
{
@ -362,12 +435,15 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
struct blk_mq_ctx *ctx = rq->mq_ctx;
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
if (rq->tag == -1 && op_is_flush(rq->cmd_flags)) {
blk_mq_sched_insert_flush(hctx, rq, can_block);
return;
/* flush rq in flush machinery need to be dispatched directly */
if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) {
blk_insert_flush(rq);
goto run;
}
if (e && blk_mq_sched_bypass_insert(hctx, rq))
WARN_ON(e && (rq->tag != -1));
if (blk_mq_sched_bypass_insert(hctx, !!e, rq))
goto run;
if (e && e->type->ops.mq.insert_requests) {
@ -393,23 +469,6 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
struct elevator_queue *e = hctx->queue->elevator;
if (e) {
struct request *rq, *next;
/*
* We bypass requests that already have a driver tag assigned,
* which should only be flushes. Flushes are only ever inserted
* as single requests, so we shouldn't ever hit the
* WARN_ON_ONCE() below (but let's handle it just in case).
*/
list_for_each_entry_safe(rq, next, list, queuelist) {
if (WARN_ON_ONCE(rq->tag != -1)) {
list_del_init(&rq->queuelist);
blk_mq_sched_bypass_insert(hctx, rq);
}
}
}
if (e && e->type->ops.mq.insert_requests)
e->type->ops.mq.insert_requests(hctx, list, false);
else

View File

@ -298,12 +298,12 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
}
EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
int (reinit_request)(void *, struct request *))
int blk_mq_tagset_iter(struct blk_mq_tag_set *set, void *data,
int (fn)(void *, struct request *))
{
int i, j, ret = 0;
if (WARN_ON_ONCE(!reinit_request))
if (WARN_ON_ONCE(!fn))
goto out;
for (i = 0; i < set->nr_hw_queues; i++) {
@ -316,8 +316,7 @@ int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
if (!tags->static_rqs[j])
continue;
ret = reinit_request(set->driver_data,
tags->static_rqs[j]);
ret = fn(data, tags->static_rqs[j]);
if (ret)
goto out;
}
@ -326,7 +325,7 @@ int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
out:
return ret;
}
EXPORT_SYMBOL_GPL(blk_mq_reinit_tagset);
EXPORT_SYMBOL_GPL(blk_mq_tagset_iter);
void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv)

View File

@ -44,14 +44,9 @@ static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
return sbq_wait_ptr(bt, &hctx->wait_index);
}
enum {
BLK_MQ_TAG_CACHE_MIN = 1,
BLK_MQ_TAG_CACHE_MAX = 64,
};
enum {
BLK_MQ_TAG_FAIL = -1U,
BLK_MQ_TAG_MIN = BLK_MQ_TAG_CACHE_MIN,
BLK_MQ_TAG_MIN = 1,
BLK_MQ_TAG_MAX = BLK_MQ_TAG_FAIL - 1,
};

View File

@ -37,6 +37,7 @@
#include "blk-wbt.h"
#include "blk-mq-sched.h"
static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
static void blk_mq_poll_stats_start(struct request_queue *q);
static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
@ -60,10 +61,10 @@ static int blk_mq_poll_stats_bkt(const struct request *rq)
/*
* Check if any of the ctx's have pending work in this hardware queue
*/
bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
{
return sbitmap_any_bit_set(&hctx->ctx_map) ||
!list_empty_careful(&hctx->dispatch) ||
return !list_empty_careful(&hctx->dispatch) ||
sbitmap_any_bit_set(&hctx->ctx_map) ||
blk_mq_sched_has_work(hctx);
}
@ -125,7 +126,8 @@ void blk_freeze_queue_start(struct request_queue *q)
freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(&q->q_usage_counter);
blk_mq_run_hw_queues(q, false);
if (q->mq_ops)
blk_mq_run_hw_queues(q, false);
}
}
EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
@ -255,13 +257,6 @@ void blk_mq_wake_waiters(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_wakeup_all(hctx->tags, true);
/*
* If we are called because the queue has now been marked as
* dying, we need to ensure that processes currently waiting on
* the queue are notified as well.
*/
wake_up_all(&q->mq_freeze_wq);
}
bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
@ -296,6 +291,8 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
rq->q = data->q;
rq->mq_ctx = data->ctx;
rq->cmd_flags = op;
if (data->flags & BLK_MQ_REQ_PREEMPT)
rq->rq_flags |= RQF_PREEMPT;
if (blk_queue_io_stat(data->q))
rq->rq_flags |= RQF_IO_STAT;
/* do not touch atomic flags, it needs atomic ops against the timer */
@ -336,12 +333,14 @@ static struct request *blk_mq_get_request(struct request_queue *q,
struct elevator_queue *e = q->elevator;
struct request *rq;
unsigned int tag;
struct blk_mq_ctx *local_ctx = NULL;
bool put_ctx_on_error = false;
blk_queue_enter_live(q);
data->q = q;
if (likely(!data->ctx))
data->ctx = local_ctx = blk_mq_get_ctx(q);
if (likely(!data->ctx)) {
data->ctx = blk_mq_get_ctx(q);
put_ctx_on_error = true;
}
if (likely(!data->hctx))
data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
if (op & REQ_NOWAIT)
@ -360,8 +359,8 @@ static struct request *blk_mq_get_request(struct request_queue *q,
tag = blk_mq_get_tag(data);
if (tag == BLK_MQ_TAG_FAIL) {
if (local_ctx) {
blk_mq_put_ctx(local_ctx);
if (put_ctx_on_error) {
blk_mq_put_ctx(data->ctx);
data->ctx = NULL;
}
blk_queue_exit(q);
@ -384,13 +383,13 @@ static struct request *blk_mq_get_request(struct request_queue *q,
}
struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
unsigned int flags)
blk_mq_req_flags_t flags)
{
struct blk_mq_alloc_data alloc_data = { .flags = flags };
struct request *rq;
int ret;
ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
ret = blk_queue_enter(q, flags);
if (ret)
return ERR_PTR(ret);
@ -410,7 +409,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
EXPORT_SYMBOL(blk_mq_alloc_request);
struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
unsigned int op, unsigned int flags, unsigned int hctx_idx)
unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
{
struct blk_mq_alloc_data alloc_data = { .flags = flags };
struct request *rq;
@ -429,7 +428,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
if (hctx_idx >= q->nr_hw_queues)
return ERR_PTR(-EIO);
ret = blk_queue_enter(q, true);
ret = blk_queue_enter(q, flags);
if (ret)
return ERR_PTR(ret);
@ -476,8 +475,14 @@ void blk_mq_free_request(struct request *rq)
if (rq->rq_flags & RQF_MQ_INFLIGHT)
atomic_dec(&hctx->nr_active);
if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
laptop_io_completion(q->backing_dev_info);
wbt_done(q->rq_wb, &rq->issue_stat);
if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
if (rq->tag != -1)
@ -593,22 +598,32 @@ void blk_mq_start_request(struct request *rq)
blk_add_timer(rq);
/*
* Ensure that ->deadline is visible before set the started
* flag and clear the completed flag.
*/
smp_mb__before_atomic();
WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
/*
* Mark us as started and clear complete. Complete might have been
* set if requeue raced with timeout, which then marked it as
* complete. So be sure to clear complete again when we start
* the request, otherwise we'll ignore the completion event.
*
* Ensure that ->deadline is visible before we set STARTED, such that
* blk_mq_check_expired() is guaranteed to observe our ->deadline when
* it observes STARTED.
*/
if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
smp_wmb();
set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
/*
* Coherence order guarantees these consecutive stores to a
* single variable propagate in the specified order. Thus the
* clear_bit() is ordered _after_ the set bit. See
* blk_mq_check_expired().
*
* (the bits must be part of the same byte for this to be
* true).
*/
clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
}
if (q->dma_drain_size && blk_rq_bytes(rq)) {
/*
@ -634,6 +649,8 @@ static void __blk_mq_requeue_request(struct request *rq)
{
struct request_queue *q = rq->q;
blk_mq_put_driver_tag(rq);
trace_block_rq_requeue(q, rq);
wbt_requeue(q->rq_wb, &rq->issue_stat);
blk_mq_sched_requeue_request(rq);
@ -690,7 +707,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
/*
* We abuse this flag that is otherwise used by the I/O scheduler to
* request head insertation from the workqueue.
* request head insertion from the workqueue.
*/
BUG_ON(rq->rq_flags & RQF_SOFTBARRIER);
@ -778,10 +795,19 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
{
struct blk_mq_timeout_data *data = priv;
unsigned long deadline;
if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
return;
/*
* Ensures that if we see STARTED we must also see our
* up-to-date deadline, see blk_mq_start_request().
*/
smp_rmb();
deadline = READ_ONCE(rq->deadline);
/*
* The rq being checked may have been freed and reallocated
* out already here, we avoid this race by checking rq->deadline
@ -795,11 +821,20 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
* and clearing the flag in blk_mq_start_request(), so
* this rq won't be timed out too.
*/
if (time_after_eq(jiffies, rq->deadline)) {
if (!blk_mark_rq_complete(rq))
if (time_after_eq(jiffies, deadline)) {
if (!blk_mark_rq_complete(rq)) {
/*
* Again coherence order ensures that consecutive reads
* from the same variable must be in that order. This
* ensures that if we see COMPLETE clear, we must then
* see STARTED set and we'll ignore this timeout.
*
* (There's also the MB implied by the test_and_clear())
*/
blk_mq_rq_timed_out(rq, reserved);
} else if (!data->next_set || time_after(data->next, rq->deadline)) {
data->next = rq->deadline;
}
} else if (!data->next_set || time_after(data->next, deadline)) {
data->next = deadline;
data->next_set = 1;
}
}
@ -880,6 +915,45 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
}
EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
struct dispatch_rq_data {
struct blk_mq_hw_ctx *hctx;
struct request *rq;
};
static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr,
void *data)
{
struct dispatch_rq_data *dispatch_data = data;
struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
spin_lock(&ctx->lock);
if (unlikely(!list_empty(&ctx->rq_list))) {
dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
list_del_init(&dispatch_data->rq->queuelist);
if (list_empty(&ctx->rq_list))
sbitmap_clear_bit(sb, bitnr);
}
spin_unlock(&ctx->lock);
return !dispatch_data->rq;
}
struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
struct blk_mq_ctx *start)
{
unsigned off = start ? start->index_hw : 0;
struct dispatch_rq_data data = {
.hctx = hctx,
.rq = NULL,
};
__sbitmap_for_each_set(&hctx->ctx_map, off,
dispatch_rq_from_ctx, &data);
return data.rq;
}
static inline unsigned int queued_to_index(unsigned int queued)
{
if (!queued)
@ -920,109 +994,95 @@ done:
return rq->tag != -1;
}
static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
struct request *rq)
{
blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
rq->tag = -1;
if (rq->rq_flags & RQF_MQ_INFLIGHT) {
rq->rq_flags &= ~RQF_MQ_INFLIGHT;
atomic_dec(&hctx->nr_active);
}
}
static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
struct request *rq)
{
if (rq->tag == -1 || rq->internal_tag == -1)
return;
__blk_mq_put_driver_tag(hctx, rq);
}
static void blk_mq_put_driver_tag(struct request *rq)
{
struct blk_mq_hw_ctx *hctx;
if (rq->tag == -1 || rq->internal_tag == -1)
return;
hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu);
__blk_mq_put_driver_tag(hctx, rq);
}
/*
* If we fail getting a driver tag because all the driver tags are already
* assigned and on the dispatch list, BUT the first entry does not have a
* tag, then we could deadlock. For that case, move entries with assigned
* driver tags to the front, leaving the set of tagged requests in the
* same order, and the untagged set in the same order.
*/
static bool reorder_tags_to_front(struct list_head *list)
{
struct request *rq, *tmp, *first = NULL;
list_for_each_entry_safe_reverse(rq, tmp, list, queuelist) {
if (rq == first)
break;
if (rq->tag != -1) {
list_move(&rq->queuelist, list);
if (!first)
first = rq;
}
}
return first != NULL;
}
static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, int flags,
void *key)
static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
int flags, void *key)
{
struct blk_mq_hw_ctx *hctx;
hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
list_del(&wait->entry);
clear_bit_unlock(BLK_MQ_S_TAG_WAITING, &hctx->state);
list_del_init(&wait->entry);
blk_mq_run_hw_queue(hctx, true);
return 1;
}
static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
/*
* Mark us waiting for a tag. For shared tags, this involves hooking us into
* the tag wakeups. For non-shared tags, we can simply mark us nedeing a
* restart. For both caes, take care to check the condition again after
* marking us as waiting.
*/
static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx,
struct request *rq)
{
struct blk_mq_hw_ctx *this_hctx = *hctx;
bool shared_tags = (this_hctx->flags & BLK_MQ_F_TAG_SHARED) != 0;
struct sbq_wait_state *ws;
wait_queue_entry_t *wait;
bool ret;
if (!shared_tags) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state))
set_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state);
} else {
wait = &this_hctx->dispatch_wait;
if (!list_empty_careful(&wait->entry))
return false;
spin_lock(&this_hctx->lock);
if (!list_empty(&wait->entry)) {
spin_unlock(&this_hctx->lock);
return false;
}
ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
add_wait_queue(&ws->wait, wait);
}
/*
* The TAG_WAITING bit serves as a lock protecting hctx->dispatch_wait.
* The thread which wins the race to grab this bit adds the hardware
* queue to the wait queue.
* It's possible that a tag was freed in the window between the
* allocation failure and adding the hardware queue to the wait
* queue.
*/
if (test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state) ||
test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, &hctx->state))
return false;
ret = blk_mq_get_driver_tag(rq, hctx, false);
init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
ws = bt_wait_ptr(&hctx->tags->bitmap_tags, hctx);
if (!shared_tags) {
/*
* Don't clear RESTART here, someone else could have set it.
* At most this will cost an extra queue run.
*/
return ret;
} else {
if (!ret) {
spin_unlock(&this_hctx->lock);
return false;
}
/*
* As soon as this returns, it's no longer safe to fiddle with
* hctx->dispatch_wait, since a completion can wake up the wait queue
* and unlock the bit.
*/
add_wait_queue(&ws->wait, &hctx->dispatch_wait);
return true;
/*
* We got a tag, remove ourselves from the wait queue to ensure
* someone else gets the wakeup.
*/
spin_lock_irq(&ws->wait.lock);
list_del_init(&wait->entry);
spin_unlock_irq(&ws->wait.lock);
spin_unlock(&this_hctx->lock);
return true;
}
}
bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
bool got_budget)
{
struct blk_mq_hw_ctx *hctx;
struct request *rq;
struct request *rq, *nxt;
bool no_tag = false;
int errors, queued;
if (list_empty(list))
return false;
WARN_ON(!list_is_singular(list) && got_budget);
/*
* Now process all the entries, sending them to the driver.
*/
@ -1033,23 +1093,29 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
rq = list_first_entry(list, struct request, queuelist);
if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
if (!queued && reorder_tags_to_front(list))
continue;
/*
* The initial allocation attempt failed, so we need to
* rerun the hardware queue when a tag is freed.
* rerun the hardware queue when a tag is freed. The
* waitqueue takes care of that. If the queue is run
* before we add this entry back on the dispatch list,
* we'll re-run it below.
*/
if (!blk_mq_dispatch_wait_add(hctx))
if (!blk_mq_mark_tag_wait(&hctx, rq)) {
if (got_budget)
blk_mq_put_dispatch_budget(hctx);
/*
* For non-shared tags, the RESTART check
* will suffice.
*/
if (hctx->flags & BLK_MQ_F_TAG_SHARED)
no_tag = true;
break;
}
}
/*
* It's possible that a tag was freed in the window
* between the allocation failure and adding the
* hardware queue to the wait queue.
*/
if (!blk_mq_get_driver_tag(rq, &hctx, false))
break;
if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) {
blk_mq_put_driver_tag(rq);
break;
}
list_del_init(&rq->queuelist);
@ -1063,15 +1129,21 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
if (list_empty(list))
bd.last = true;
else {
struct request *nxt;
nxt = list_first_entry(list, struct request, queuelist);
bd.last = !blk_mq_get_driver_tag(nxt, NULL, false);
}
ret = q->mq_ops->queue_rq(hctx, &bd);
if (ret == BLK_STS_RESOURCE) {
blk_mq_put_driver_tag_hctx(hctx, rq);
/*
* If an I/O scheduler has been configured and we got a
* driver tag for the next request already, free it
* again.
*/
if (!list_empty(list)) {
nxt = list_first_entry(list, struct request, queuelist);
blk_mq_put_driver_tag(nxt);
}
list_add(&rq->queuelist, list);
__blk_mq_requeue_request(rq);
break;
@ -1093,13 +1165,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
* that is where we will continue on next queue run.
*/
if (!list_empty(list)) {
/*
* If an I/O scheduler has been configured and we got a driver
* tag for the next request already, free it again.
*/
rq = list_first_entry(list, struct request, queuelist);
blk_mq_put_driver_tag(rq);
spin_lock(&hctx->lock);
list_splice_init(list, &hctx->dispatch);
spin_unlock(&hctx->lock);
@ -1109,10 +1174,10 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
* it is no longer set that means that it was cleared by another
* thread and hence that a queue rerun is needed.
*
* If TAG_WAITING is set that means that an I/O scheduler has
* been configured and another thread is waiting for a driver
* tag. To guarantee fairness, do not rerun this hardware queue
* but let the other thread grab the driver tag.
* If 'no_tag' is set, that means that we failed getting
* a driver tag with an I/O scheduler attached. If our dispatch
* waitqueue is no longer active, ensure that we run the queue
* AFTER adding our entries back to the list.
*
* If no I/O scheduler has been configured it is possible that
* the hardware queue got stopped and restarted before requests
@ -1124,8 +1189,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
* returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
* and dm-rq.
*/
if (!blk_mq_sched_needs_restart(hctx) &&
!test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
if (!blk_mq_sched_needs_restart(hctx) ||
(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
blk_mq_run_hw_queue(hctx, true);
}
@ -1218,9 +1283,14 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
}
EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
{
__blk_mq_delay_run_hw_queue(hctx, async, 0);
if (blk_mq_hctx_has_pending(hctx)) {
__blk_mq_delay_run_hw_queue(hctx, async, 0);
return true;
}
return false;
}
EXPORT_SYMBOL(blk_mq_run_hw_queue);
@ -1230,8 +1300,7 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
int i;
queue_for_each_hw_ctx(q, hctx, i) {
if (!blk_mq_hctx_has_pending(hctx) ||
blk_mq_hctx_stopped(hctx))
if (blk_mq_hctx_stopped(hctx))
continue;
blk_mq_run_hw_queue(hctx, async);
@ -1405,7 +1474,7 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
* Should only be used carefully, when the caller knows we want to
* bypass a potential IO scheduler on the target device.
*/
void blk_mq_request_bypass_insert(struct request *rq)
void blk_mq_request_bypass_insert(struct request *rq, bool run_queue)
{
struct blk_mq_ctx *ctx = rq->mq_ctx;
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
@ -1414,7 +1483,8 @@ void blk_mq_request_bypass_insert(struct request *rq)
list_add_tail(&rq->queuelist, &hctx->dispatch);
spin_unlock(&hctx->lock);
blk_mq_run_hw_queue(hctx, false);
if (run_queue)
blk_mq_run_hw_queue(hctx, false);
}
void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
@ -1501,13 +1571,9 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio)
{
blk_init_request_from_bio(rq, bio);
blk_account_io_start(rq, true);
}
blk_rq_set_rl(rq, blk_get_rl(rq->q, bio));
static inline bool hctx_allow_merges(struct blk_mq_hw_ctx *hctx)
{
return (hctx->flags & BLK_MQ_F_SHOULD_MERGE) &&
!blk_queue_nomerges(hctx->queue);
blk_account_io_start(rq, true);
}
static inline void blk_mq_queue_io(struct blk_mq_hw_ctx *hctx,
@ -1552,6 +1618,11 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
if (!blk_mq_get_driver_tag(rq, NULL, false))
goto insert;
if (!blk_mq_get_dispatch_budget(hctx)) {
blk_mq_put_driver_tag(rq);
goto insert;
}
new_cookie = request_to_qc_t(hctx, rq);
/*
@ -1641,13 +1712,10 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
if (unlikely(is_flush_fua)) {
blk_mq_put_ctx(data.ctx);
blk_mq_bio_to_request(rq, bio);
if (q->elevator) {
blk_mq_sched_insert_request(rq, false, true, true,
true);
} else {
blk_insert_flush(rq);
blk_mq_run_hw_queue(data.hctx, true);
}
/* bypass scheduler for flush rq */
blk_insert_flush(rq);
blk_mq_run_hw_queue(data.hctx, true);
} else if (plug && q->nr_hw_queues == 1) {
struct request *last = NULL;
@ -1990,6 +2058,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
hctx->nr_ctx = 0;
init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
if (set->ops->init_hctx &&
set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
goto free_bitmap;
@ -2229,8 +2300,11 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
mutex_lock(&set->tag_list_lock);
/* Check to see if we're transitioning to shared (from 1 to 2 queues). */
if (!list_empty(&set->tag_list) && !(set->flags & BLK_MQ_F_TAG_SHARED)) {
/*
* Check to see if we're transitioning to shared (from 1 to 2 queues).
*/
if (!list_empty(&set->tag_list) &&
!(set->flags & BLK_MQ_F_TAG_SHARED)) {
set->flags |= BLK_MQ_F_TAG_SHARED;
/* update existing queue */
blk_mq_update_tag_set_depth(set, true);
@ -2404,6 +2478,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
spin_lock_init(&q->requeue_lock);
blk_queue_make_request(q, blk_mq_make_request);
if (q->mq_ops->poll)
q->poll_fn = blk_mq_poll;
/*
* Do this after blk_queue_make_request() overrides it...
@ -2460,10 +2536,9 @@ static void blk_mq_queue_reinit(struct request_queue *q)
/*
* redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
* we should change hctx numa_node according to new topology (this
* involves free and re-allocate memory, worthy doing?)
* we should change hctx numa_node according to the new topology (this
* involves freeing and re-allocating memory, worth doing?)
*/
blk_mq_map_swqueue(q);
blk_mq_sysfs_register(q);
@ -2552,6 +2627,9 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
if (!set->ops->queue_rq)
return -EINVAL;
if (!set->ops->get_budget ^ !set->ops->put_budget)
return -EINVAL;
if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
pr_info("blk-mq: reduced tag depth to %u\n",
BLK_MQ_MAX_DEPTH);
@ -2642,8 +2720,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
* queue depth. This is similar to what the old code would do.
*/
if (!hctx->sched_tags) {
ret = blk_mq_tag_update_depth(hctx, &hctx->tags,
min(nr, set->queue_depth),
ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
false);
} else {
ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
@ -2863,20 +2940,14 @@ static bool __blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq)
return false;
}
bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
{
struct blk_mq_hw_ctx *hctx;
struct blk_plug *plug;
struct request *rq;
if (!q->mq_ops || !q->mq_ops->poll || !blk_qc_t_valid(cookie) ||
!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
return false;
plug = current->plug;
if (plug)
blk_flush_plug_list(plug, false);
hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
if (!blk_qc_t_is_internal(cookie))
rq = blk_mq_tag_to_rq(hctx->tags, blk_qc_t_to_tag(cookie));
@ -2894,10 +2965,15 @@ bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
return __blk_mq_poll(hctx, rq);
}
EXPORT_SYMBOL_GPL(blk_mq_poll);
static int __init blk_mq_init(void)
{
/*
* See comment in block/blk.h rq_atomic_flags enum
*/
BUILD_BUG_ON((REQ_ATOM_STARTED / BITS_PER_BYTE) !=
(REQ_ATOM_COMPLETE / BITS_PER_BYTE));
cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
blk_mq_hctx_notify_dead);
return 0;

View File

@ -3,6 +3,7 @@
#define INT_BLK_MQ_H
#include "blk-stat.h"
#include "blk-mq-tag.h"
struct blk_mq_tag_set;
@ -26,16 +27,16 @@ struct blk_mq_ctx {
struct kobject kobj;
} ____cacheline_aligned_in_smp;
void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
void blk_mq_freeze_queue(struct request_queue *q);
void blk_mq_free_queue(struct request_queue *q);
int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
void blk_mq_wake_waiters(struct request_queue *q);
bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *);
bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
bool wait);
struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
struct blk_mq_ctx *start);
/*
* Internal helpers for allocating/freeing the request map
@ -55,7 +56,7 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
*/
void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
bool at_head);
void blk_mq_request_bypass_insert(struct request *rq);
void blk_mq_request_bypass_insert(struct request *rq, bool run_queue);
void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
struct list_head *list);
@ -109,7 +110,7 @@ static inline void blk_mq_put_ctx(struct blk_mq_ctx *ctx)
struct blk_mq_alloc_data {
/* input parameter */
struct request_queue *q;
unsigned int flags;
blk_mq_req_flags_t flags;
unsigned int shallow_depth;
/* input & output parameter */
@ -138,4 +139,53 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
unsigned int inflight[2]);
static inline void blk_mq_put_dispatch_budget(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
if (q->mq_ops->put_budget)
q->mq_ops->put_budget(hctx);
}
static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
if (q->mq_ops->get_budget)
return q->mq_ops->get_budget(hctx);
return true;
}
static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
struct request *rq)
{
blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
rq->tag = -1;
if (rq->rq_flags & RQF_MQ_INFLIGHT) {
rq->rq_flags &= ~RQF_MQ_INFLIGHT;
atomic_dec(&hctx->nr_active);
}
}
static inline void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
struct request *rq)
{
if (rq->tag == -1 || rq->internal_tag == -1)
return;
__blk_mq_put_driver_tag(hctx, rq);
}
static inline void blk_mq_put_driver_tag(struct request *rq)
{
struct blk_mq_hw_ctx *hctx;
if (rq->tag == -1 || rq->internal_tag == -1)
return;
hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu);
__blk_mq_put_driver_tag(hctx, rq);
}
#endif

View File

@ -157,7 +157,7 @@ EXPORT_SYMBOL(blk_set_stacking_limits);
* Caveat:
* The driver that does this *must* be able to deal appropriately
* with buffers in "highmemory". This can be accomplished by either calling
* __bio_kmap_atomic() to get a temporary kernel mapping, or by calling
* kmap_atomic() to get a temporary kernel mapping, or by calling
* blk_queue_bounce() to create a buffer in normal memory.
**/
void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)

View File

@ -11,8 +11,6 @@
#include "blk-mq.h"
#include "blk.h"
#define BLK_RQ_STAT_BATCH 64
struct blk_queue_stats {
struct list_head callbacks;
spinlock_t lock;
@ -23,45 +21,21 @@ static void blk_stat_init(struct blk_rq_stat *stat)
{
stat->min = -1ULL;
stat->max = stat->nr_samples = stat->mean = 0;
stat->batch = stat->nr_batch = 0;
}
static void blk_stat_flush_batch(struct blk_rq_stat *stat)
{
const s32 nr_batch = READ_ONCE(stat->nr_batch);
const s32 nr_samples = READ_ONCE(stat->nr_samples);
if (!nr_batch)
return;
if (!nr_samples)
stat->mean = div64_s64(stat->batch, nr_batch);
else {
stat->mean = div64_s64((stat->mean * nr_samples) +
stat->batch,
nr_batch + nr_samples);
}
stat->nr_samples += nr_batch;
stat->nr_batch = stat->batch = 0;
stat->batch = 0;
}
/* src is a per-cpu stat, mean isn't initialized */
static void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
{
blk_stat_flush_batch(src);
if (!src->nr_samples)
return;
dst->min = min(dst->min, src->min);
dst->max = max(dst->max, src->max);
if (!dst->nr_samples)
dst->mean = src->mean;
else {
dst->mean = div64_s64((src->mean * src->nr_samples) +
(dst->mean * dst->nr_samples),
dst->nr_samples + src->nr_samples);
}
dst->mean = div_u64(src->batch + dst->mean * dst->nr_samples,
dst->nr_samples + src->nr_samples);
dst->nr_samples += src->nr_samples;
}
@ -69,13 +43,8 @@ static void __blk_stat_add(struct blk_rq_stat *stat, u64 value)
{
stat->min = min(stat->min, value);
stat->max = max(stat->max, value);
if (stat->batch + value < stat->batch ||
stat->nr_batch + 1 == BLK_RQ_STAT_BATCH)
blk_stat_flush_batch(stat);
stat->batch += value;
stat->nr_batch++;
stat->nr_samples++;
}
void blk_stat_add(struct request *rq)
@ -84,7 +53,7 @@ void blk_stat_add(struct request *rq)
struct blk_stat_callback *cb;
struct blk_rq_stat *stat;
int bucket;
s64 now, value;
u64 now, value;
now = __blk_stat_time(ktime_to_ns(ktime_get()));
if (now < blk_stat_time(&rq->issue_stat))

View File

@ -2113,8 +2113,12 @@ static inline void throtl_update_latency_buckets(struct throtl_data *td)
static void blk_throtl_assoc_bio(struct throtl_grp *tg, struct bio *bio)
{
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
if (bio->bi_css)
if (bio->bi_css) {
if (bio->bi_cg_private)
blkg_put(tg_to_blkg(bio->bi_cg_private));
bio->bi_cg_private = tg;
blkg_get(tg_to_blkg(tg));
}
blk_stat_set_issue(&bio->bi_issue_stat, bio_sectors(bio));
#endif
}
@ -2284,8 +2288,10 @@ void blk_throtl_bio_endio(struct bio *bio)
start_time = blk_stat_time(&bio->bi_issue_stat) >> 10;
finish_time = __blk_stat_time(finish_time_ns) >> 10;
if (!start_time || finish_time <= start_time)
if (!start_time || finish_time <= start_time) {
blkg_put(tg_to_blkg(tg));
return;
}
lat = finish_time - start_time;
/* this is only for bio based driver */
@ -2315,6 +2321,8 @@ void blk_throtl_bio_endio(struct bio *bio)
tg->bio_cnt /= 2;
tg->bad_bio_cnt /= 2;
}
blkg_put(tg_to_blkg(tg));
}
#endif

View File

@ -134,8 +134,6 @@ void blk_timeout_work(struct work_struct *work)
struct request *rq, *tmp;
int next_set = 0;
if (blk_queue_enter(q, true))
return;
spin_lock_irqsave(q->queue_lock, flags);
list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list)
@ -145,7 +143,6 @@ void blk_timeout_work(struct work_struct *work)
mod_timer(&q->timeout, round_jiffies_up(next));
spin_unlock_irqrestore(q->queue_lock, flags);
blk_queue_exit(q);
}
/**
@ -211,7 +208,7 @@ void blk_add_timer(struct request *req)
if (!req->timeout)
req->timeout = q->rq_timeout;
req->deadline = jiffies + req->timeout;
WRITE_ONCE(req->deadline, jiffies + req->timeout);
/*
* Only the non-mq case needs to add the request to a protected list.

View File

@ -654,7 +654,7 @@ void wbt_set_write_cache(struct rq_wb *rwb, bool write_cache_on)
}
/*
* Disable wbt, if enabled by default. Only called from CFQ.
* Disable wbt, if enabled by default.
*/
void wbt_disable_default(struct request_queue *q)
{

View File

@ -123,8 +123,15 @@ void blk_account_io_done(struct request *req);
* Internal atomic flags for request handling
*/
enum rq_atomic_flags {
/*
* Keep these two bits first - not because we depend on the
* value of them, but we do depend on them being in the same
* byte of storage to ensure ordering on writes. Keeping them
* first will achieve that nicely.
*/
REQ_ATOM_COMPLETE = 0,
REQ_ATOM_STARTED,
REQ_ATOM_POLL_SLEPT,
};
@ -149,45 +156,6 @@ static inline void blk_clear_rq_complete(struct request *rq)
void blk_insert_flush(struct request *rq);
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;
struct blk_flush_queue *fq = blk_get_flush_queue(q, NULL);
WARN_ON_ONCE(q->mq_ops);
while (1) {
if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
return rq;
}
/*
* Flush request is running and flush request isn't queueable
* in the drive, we can hold the queue till flush request is
* finished. Even we don't do this, driver can't dispatch next
* requests and will requeue them. And this can improve
* throughput too. For example, we have request flush1, write1,
* flush 2. flush1 is dispatched, then queue is hold, write1
* isn't inserted to queue. After flush1 is finished, flush2
* will be dispatched. Since disk cache is already clean,
* flush2 will be finished very soon, so looks like flush2 is
* folded to flush1.
* Since the queue is hold, a flag is set to indicate the queue
* should be restarted later. Please see flush_end_io() for
* details.
*/
if (fq->flush_pending_idx != fq->flush_running_idx &&
!queue_flush_queueable(q)) {
fq->flush_queue_delayed = 1;
return NULL;
}
if (unlikely(blk_queue_bypass(q)) ||
!q->elevator->type->ops.sq.elevator_dispatch_fn(q, 0))
return NULL;
}
}
static inline void elv_activate_rq(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;

View File

@ -137,7 +137,7 @@ static inline struct hlist_head *bsg_dev_idx_hash(int index)
static int blk_fill_sgv4_hdr_rq(struct request_queue *q, struct request *rq,
struct sg_io_v4 *hdr, struct bsg_device *bd,
fmode_t has_write_perm)
fmode_t mode)
{
struct scsi_request *req = scsi_req(rq);
@ -152,7 +152,7 @@ static int blk_fill_sgv4_hdr_rq(struct request_queue *q, struct request *rq,
return -EFAULT;
if (hdr->subprotocol == BSG_SUB_PROTOCOL_SCSI_CMD) {
if (blk_verify_command(req->cmd, has_write_perm))
if (blk_verify_command(req->cmd, mode))
return -EPERM;
} else if (!capable(CAP_SYS_RAWIO))
return -EPERM;
@ -206,7 +206,7 @@ bsg_validate_sgv4_hdr(struct sg_io_v4 *hdr, int *op)
* map sg_io_v4 to a request.
*/
static struct request *
bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, fmode_t has_write_perm)
bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, fmode_t mode)
{
struct request_queue *q = bd->queue;
struct request *rq, *next_rq = NULL;
@ -237,7 +237,7 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, fmode_t has_write_perm)
if (IS_ERR(rq))
return rq;
ret = blk_fill_sgv4_hdr_rq(q, rq, hdr, bd, has_write_perm);
ret = blk_fill_sgv4_hdr_rq(q, rq, hdr, bd, mode);
if (ret)
goto out;
@ -587,8 +587,7 @@ bsg_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
}
static int __bsg_write(struct bsg_device *bd, const char __user *buf,
size_t count, ssize_t *bytes_written,
fmode_t has_write_perm)
size_t count, ssize_t *bytes_written, fmode_t mode)
{
struct bsg_command *bc;
struct request *rq;
@ -619,7 +618,7 @@ static int __bsg_write(struct bsg_device *bd, const char __user *buf,
/*
* get a request, fill in the blanks, and add to request queue
*/
rq = bsg_map_hdr(bd, &bc->hdr, has_write_perm);
rq = bsg_map_hdr(bd, &bc->hdr, mode);
if (IS_ERR(rq)) {
ret = PTR_ERR(rq);
rq = NULL;
@ -655,8 +654,7 @@ bsg_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
bsg_set_block(bd, file);
bytes_written = 0;
ret = __bsg_write(bd, buf, count, &bytes_written,
file->f_mode & FMODE_WRITE);
ret = __bsg_write(bd, buf, count, &bytes_written, file->f_mode);
*ppos = bytes_written;
@ -915,7 +913,7 @@ static long bsg_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
if (copy_from_user(&hdr, uarg, sizeof(hdr)))
return -EFAULT;
rq = bsg_map_hdr(bd, &hdr, file->f_mode & FMODE_WRITE);
rq = bsg_map_hdr(bd, &hdr, file->f_mode);
if (IS_ERR(rq))
return PTR_ERR(rq);

View File

@ -83,12 +83,25 @@ bool elv_bio_merge_ok(struct request *rq, struct bio *bio)
}
EXPORT_SYMBOL(elv_bio_merge_ok);
static struct elevator_type *elevator_find(const char *name)
static bool elevator_match(const struct elevator_type *e, const char *name)
{
if (!strcmp(e->elevator_name, name))
return true;
if (e->elevator_alias && !strcmp(e->elevator_alias, name))
return true;
return false;
}
/*
* Return scheduler with name 'name' and with matching 'mq capability
*/
static struct elevator_type *elevator_find(const char *name, bool mq)
{
struct elevator_type *e;
list_for_each_entry(e, &elv_list, list) {
if (!strcmp(e->elevator_name, name))
if (elevator_match(e, name) && (mq == e->uses_mq))
return e;
}
@ -100,25 +113,25 @@ static void elevator_put(struct elevator_type *e)
module_put(e->elevator_owner);
}
static struct elevator_type *elevator_get(const char *name, bool try_loading)
static struct elevator_type *elevator_get(struct request_queue *q,
const char *name, bool try_loading)
{
struct elevator_type *e;
spin_lock(&elv_list_lock);
e = elevator_find(name);
e = elevator_find(name, q->mq_ops != NULL);
if (!e && try_loading) {
spin_unlock(&elv_list_lock);
request_module("%s-iosched", name);
spin_lock(&elv_list_lock);
e = elevator_find(name);
e = elevator_find(name, q->mq_ops != NULL);
}
if (e && !try_module_get(e->elevator_owner))
e = NULL;
spin_unlock(&elv_list_lock);
return e;
}
@ -144,8 +157,12 @@ void __init load_default_elevator_module(void)
if (!chosen_elevator[0])
return;
/*
* Boot parameter is deprecated, we haven't supported that for MQ.
* Only look for non-mq schedulers from here.
*/
spin_lock(&elv_list_lock);
e = elevator_find(chosen_elevator);
e = elevator_find(chosen_elevator, false);
spin_unlock(&elv_list_lock);
if (!e)
@ -202,7 +219,7 @@ int elevator_init(struct request_queue *q, char *name)
q->boundary_rq = NULL;
if (name) {
e = elevator_get(name, true);
e = elevator_get(q, name, true);
if (!e)
return -EINVAL;
}
@ -214,7 +231,7 @@ int elevator_init(struct request_queue *q, char *name)
* allowed from async.
*/
if (!e && !q->mq_ops && *chosen_elevator) {
e = elevator_get(chosen_elevator, false);
e = elevator_get(q, chosen_elevator, false);
if (!e)
printk(KERN_ERR "I/O scheduler %s not found\n",
chosen_elevator);
@ -229,17 +246,17 @@ int elevator_init(struct request_queue *q, char *name)
*/
if (q->mq_ops) {
if (q->nr_hw_queues == 1)
e = elevator_get("mq-deadline", false);
e = elevator_get(q, "mq-deadline", false);
if (!e)
return 0;
} else
e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
e = elevator_get(q, CONFIG_DEFAULT_IOSCHED, false);
if (!e) {
printk(KERN_ERR
"Default I/O scheduler not found. " \
"Using noop.\n");
e = elevator_get("noop", false);
e = elevator_get(q, "noop", false);
}
}
@ -905,7 +922,7 @@ int elv_register(struct elevator_type *e)
/* register, don't allow duplicate names */
spin_lock(&elv_list_lock);
if (elevator_find(e->elevator_name)) {
if (elevator_find(e->elevator_name, e->uses_mq)) {
spin_unlock(&elv_list_lock);
if (e->icq_cache)
kmem_cache_destroy(e->icq_cache);
@ -915,9 +932,9 @@ int elv_register(struct elevator_type *e)
spin_unlock(&elv_list_lock);
/* print pretty message */
if (!strcmp(e->elevator_name, chosen_elevator) ||
if (elevator_match(e, chosen_elevator) ||
(!*chosen_elevator &&
!strcmp(e->elevator_name, CONFIG_DEFAULT_IOSCHED)))
elevator_match(e, CONFIG_DEFAULT_IOSCHED)))
def = " (default)";
printk(KERN_INFO "io scheduler %s registered%s\n", e->elevator_name,
@ -1066,25 +1083,15 @@ static int __elevator_change(struct request_queue *q, const char *name)
return elevator_switch(q, NULL);
strlcpy(elevator_name, name, sizeof(elevator_name));
e = elevator_get(strstrip(elevator_name), true);
e = elevator_get(q, strstrip(elevator_name), true);
if (!e)
return -EINVAL;
if (q->elevator &&
!strcmp(elevator_name, q->elevator->type->elevator_name)) {
if (q->elevator && elevator_match(q->elevator->type, elevator_name)) {
elevator_put(e);
return 0;
}
if (!e->uses_mq && q->mq_ops) {
elevator_put(e);
return -EINVAL;
}
if (e->uses_mq && !q->mq_ops) {
elevator_put(e);
return -EINVAL;
}
return elevator_switch(q, e);
}
@ -1116,9 +1123,10 @@ ssize_t elv_iosched_show(struct request_queue *q, char *name)
struct elevator_queue *e = q->elevator;
struct elevator_type *elv = NULL;
struct elevator_type *__e;
bool uses_mq = q->mq_ops != NULL;
int len = 0;
if (!blk_queue_stackable(q))
if (!queue_is_rq_based(q))
return sprintf(name, "none\n");
if (!q->elevator)
@ -1128,7 +1136,8 @@ ssize_t elv_iosched_show(struct request_queue *q, char *name)
spin_lock(&elv_list_lock);
list_for_each_entry(__e, &elv_list, list) {
if (elv && !strcmp(elv->elevator_name, __e->elevator_name)) {
if (elv && elevator_match(elv, __e->elevator_name) &&
(__e->uses_mq == uses_mq)) {
len += sprintf(name+len, "[%s] ", elv->elevator_name);
continue;
}

View File

@ -588,6 +588,11 @@ static void register_disk(struct device *parent, struct gendisk *disk)
disk->part0.holder_dir = kobject_create_and_add("holders", &ddev->kobj);
disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
if (disk->flags & GENHD_FL_HIDDEN) {
dev_set_uevent_suppress(ddev, 0);
return;
}
/* No minors to use for partitions */
if (!disk_part_scan_enabled(disk))
goto exit;
@ -616,6 +621,11 @@ exit:
while ((part = disk_part_iter_next(&piter)))
kobject_uevent(&part_to_dev(part)->kobj, KOBJ_ADD);
disk_part_iter_exit(&piter);
err = sysfs_create_link(&ddev->kobj,
&disk->queue->backing_dev_info->dev->kobj,
"bdi");
WARN_ON(err);
}
/**
@ -630,7 +640,6 @@ exit:
*/
void device_add_disk(struct device *parent, struct gendisk *disk)
{
struct backing_dev_info *bdi;
dev_t devt;
int retval;
@ -639,7 +648,8 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
* parameters make sense.
*/
WARN_ON(disk->minors && !(disk->major || disk->first_minor));
WARN_ON(!disk->minors && !(disk->flags & GENHD_FL_EXT_DEVT));
WARN_ON(!disk->minors &&
!(disk->flags & (GENHD_FL_EXT_DEVT | GENHD_FL_HIDDEN)));
disk->flags |= GENHD_FL_UP;
@ -648,22 +658,26 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
WARN_ON(1);
return;
}
disk_to_dev(disk)->devt = devt;
/* ->major and ->first_minor aren't supposed to be
* dereferenced from here on, but set them just in case.
*/
disk->major = MAJOR(devt);
disk->first_minor = MINOR(devt);
disk_alloc_events(disk);
/* Register BDI before referencing it from bdev */
bdi = disk->queue->backing_dev_info;
bdi_register_owner(bdi, disk_to_dev(disk));
blk_register_region(disk_devt(disk), disk->minors, NULL,
exact_match, exact_lock, disk);
if (disk->flags & GENHD_FL_HIDDEN) {
/*
* Don't let hidden disks show up in /proc/partitions,
* and don't bother scanning for partitions either.
*/
disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
disk->flags |= GENHD_FL_NO_PART_SCAN;
} else {
/* Register BDI before referencing it from bdev */
disk_to_dev(disk)->devt = devt;
bdi_register_owner(disk->queue->backing_dev_info,
disk_to_dev(disk));
blk_register_region(disk_devt(disk), disk->minors, NULL,
exact_match, exact_lock, disk);
}
register_disk(parent, disk);
blk_register_queue(disk);
@ -673,10 +687,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
*/
WARN_ON_ONCE(!blk_get_queue(disk->queue));
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
disk_add_events(disk);
blk_integrity_add(disk);
}
@ -705,7 +715,8 @@ void del_gendisk(struct gendisk *disk)
set_capacity(disk, 0);
disk->flags &= ~GENHD_FL_UP;
sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
if (!(disk->flags & GENHD_FL_HIDDEN))
sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
if (disk->queue) {
/*
* Unregister bdi before releasing device numbers (as they can
@ -716,13 +727,15 @@ void del_gendisk(struct gendisk *disk)
} else {
WARN_ON(1);
}
blk_unregister_region(disk_devt(disk), disk->minors);
part_stat_set_all(&disk->part0, 0);
disk->part0.stamp = 0;
if (!(disk->flags & GENHD_FL_HIDDEN))
blk_unregister_region(disk_devt(disk), disk->minors);
kobject_put(disk->part0.holder_dir);
kobject_put(disk->slave_dir);
part_stat_set_all(&disk->part0, 0);
disk->part0.stamp = 0;
if (!sysfs_deprecated)
sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
@ -785,6 +798,10 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
spin_unlock_bh(&ext_devt_lock);
}
if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
put_disk(disk);
disk = NULL;
}
return disk;
}
EXPORT_SYMBOL(get_gendisk);
@ -1028,6 +1045,15 @@ static ssize_t disk_removable_show(struct device *dev,
(disk->flags & GENHD_FL_REMOVABLE ? 1 : 0));
}
static ssize_t disk_hidden_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct gendisk *disk = dev_to_disk(dev);
return sprintf(buf, "%d\n",
(disk->flags & GENHD_FL_HIDDEN ? 1 : 0));
}
static ssize_t disk_ro_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@ -1065,6 +1091,7 @@ static ssize_t disk_discard_alignment_show(struct device *dev,
static DEVICE_ATTR(range, S_IRUGO, disk_range_show, NULL);
static DEVICE_ATTR(ext_range, S_IRUGO, disk_ext_range_show, NULL);
static DEVICE_ATTR(removable, S_IRUGO, disk_removable_show, NULL);
static DEVICE_ATTR(hidden, S_IRUGO, disk_hidden_show, NULL);
static DEVICE_ATTR(ro, S_IRUGO, disk_ro_show, NULL);
static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
static DEVICE_ATTR(alignment_offset, S_IRUGO, disk_alignment_offset_show, NULL);
@ -1089,6 +1116,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_range.attr,
&dev_attr_ext_range.attr,
&dev_attr_removable.attr,
&dev_attr_hidden.attr,
&dev_attr_ro.attr,
&dev_attr_size.attr,
&dev_attr_alignment_offset.attr,

View File

@ -202,10 +202,16 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
{
uint64_t range[2];
uint64_t start, len;
struct request_queue *q = bdev_get_queue(bdev);
struct address_space *mapping = bdev->bd_inode->i_mapping;
if (!(mode & FMODE_WRITE))
return -EBADF;
if (!blk_queue_discard(q))
return -EOPNOTSUPP;
if (copy_from_user(range, (void __user *)arg, sizeof(range)))
return -EFAULT;
@ -216,12 +222,12 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
return -EINVAL;
if (len & 511)
return -EINVAL;
start >>= 9;
len >>= 9;
if (start + len > (i_size_read(bdev->bd_inode) >> 9))
if (start + len > i_size_read(bdev->bd_inode))
return -EINVAL;
return blkdev_issue_discard(bdev, start, len, GFP_KERNEL, flags);
truncate_inode_pages_range(mapping, start, start + len);
return blkdev_issue_discard(bdev, start >> 9, len >> 9,
GFP_KERNEL, flags);
}
static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
@ -437,11 +443,12 @@ static int blkdev_roset(struct block_device *bdev, fmode_t mode,
{
int ret, n;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg);
if (!is_unrecognized_ioctl(ret))
return ret;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
if (get_user(n, (int __user *)arg))
return -EFAULT;
set_device_ro(bdev, n);

View File

@ -541,9 +541,17 @@ static int kyber_get_domain_token(struct kyber_queue_data *kqd,
/*
* Try again in case a token was freed before we got on the wait
* queue.
* queue. The waker may have already removed the entry from the
* wait queue, but list_del_init() is okay with that.
*/
nr = __sbitmap_queue_get(domain_tokens);
if (nr >= 0) {
unsigned long flags;
spin_lock_irqsave(&ws->wait.lock, flags);
list_del_init(&wait->entry);
spin_unlock_irqrestore(&ws->wait.lock, flags);
}
}
return nr;
}
@ -641,7 +649,7 @@ static bool kyber_has_work(struct blk_mq_hw_ctx *hctx)
if (!list_empty_careful(&khd->rqs[i]))
return true;
}
return false;
return sbitmap_any_bit_set(&hctx->ctx_map);
}
#define KYBER_LAT_SHOW_STORE(op) \

View File

@ -657,6 +657,7 @@ static struct elevator_type mq_deadline = {
#endif
.elevator_attrs = deadline_attrs,
.elevator_name = "mq-deadline",
.elevator_alias = "deadline",
.elevator_owner = THIS_MODULE,
};
MODULE_ALIAS("mq-deadline-iosched");

View File

@ -207,7 +207,7 @@ static void blk_set_cmd_filter_defaults(struct blk_cmd_filter *filter)
__set_bit(GPCMD_SET_READ_AHEAD, filter->write_ok);
}
int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm)
int blk_verify_command(unsigned char *cmd, fmode_t mode)
{
struct blk_cmd_filter *filter = &blk_default_cmd_filter;
@ -220,7 +220,7 @@ int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm)
return 0;
/* Write-safe commands require a writable open */
if (test_bit(cmd[0], filter->write_ok) && has_write_perm)
if (test_bit(cmd[0], filter->write_ok) && (mode & FMODE_WRITE))
return 0;
return -EPERM;
@ -234,7 +234,7 @@ static int blk_fill_sghdr_rq(struct request_queue *q, struct request *rq,
if (copy_from_user(req->cmd, hdr->cmdp, hdr->cmd_len))
return -EFAULT;
if (blk_verify_command(req->cmd, mode & FMODE_WRITE))
if (blk_verify_command(req->cmd, mode))
return -EPERM;
/*
@ -469,7 +469,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
if (in_len && copy_from_user(buffer, sic->data + cmdlen, in_len))
goto error;
err = blk_verify_command(req->cmd, mode & FMODE_WRITE);
err = blk_verify_command(req->cmd, mode);
if (err)
goto error;

View File

@ -68,9 +68,13 @@ config AMIGA_Z2RAM
To compile this driver as a module, choose M here: the
module will be called z2ram.
config CDROM
tristate
config GDROM
tristate "SEGA Dreamcast GD-ROM drive"
depends on SH_DREAMCAST
select CDROM
select BLK_SCSI_REQUEST # only for the generic cdrom code
help
A standard SEGA Dreamcast comes with a modified CD ROM drive called a
@ -348,6 +352,7 @@ config BLK_DEV_RAM_DAX
config CDROM_PKTCDVD
tristate "Packet writing on CD/DVD media (DEPRECATED)"
depends on !UML
select CDROM
select BLK_SCSI_REQUEST
help
Note: This driver is deprecated and will be removed from the

View File

@ -60,7 +60,6 @@ struct brd_device {
/*
* Look up and return a brd's page for a given sector.
*/
static DEFINE_MUTEX(brd_mutex);
static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
{
pgoff_t idx;

View File

@ -43,7 +43,6 @@ cryptoloop_init(struct loop_device *lo, const struct loop_info64 *info)
int cipher_len;
int mode_len;
char cms[LO_NAME_SIZE]; /* cipher-mode string */
char *cipher;
char *mode;
char *cmsp = cms; /* c-m string pointer */
struct crypto_skcipher *tfm;
@ -56,7 +55,6 @@ cryptoloop_init(struct loop_device *lo, const struct loop_info64 *info)
strncpy(cms, info->lo_crypt_name, LO_NAME_SIZE);
cms[LO_NAME_SIZE - 1] = 0;
cipher = cmsp;
cipher_len = strcspn(cmsp, "-");
mode = cmsp + cipher_len;

View File

@ -476,6 +476,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
{
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
if (cmd->css)
css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
}
@ -535,6 +537,8 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
cmd->iocb.ki_filp = file;
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
if (cmd->css)
kthread_associate_blkcg(cmd->css);
if (rw == WRITE)
ret = call_write_iter(file, &cmd->iocb, &iter);
@ -542,6 +546,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
ret = call_read_iter(file, &cmd->iocb, &iter);
lo_rw_aio_do_completion(cmd);
kthread_associate_blkcg(NULL);
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
@ -1686,6 +1691,14 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
break;
}
/* always use the first bio's css */
#ifdef CONFIG_BLK_CGROUP
if (cmd->use_aio && cmd->rq->bio && cmd->rq->bio->bi_css) {
cmd->css = cmd->rq->bio->bi_css;
css_get(cmd->css);
} else
#endif
cmd->css = NULL;
kthread_queue_work(&lo->worker, &cmd->work);
return BLK_STS_OK;

View File

@ -72,6 +72,7 @@ struct loop_cmd {
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
struct cgroup_subsys_state *css;
};
/* Support for loadable transfer modules */

View File

@ -887,12 +887,9 @@ static void mtip_issue_non_ncq_command(struct mtip_port *port, int tag)
static bool mtip_pause_ncq(struct mtip_port *port,
struct host_to_dev_fis *fis)
{
struct host_to_dev_fis *reply;
unsigned long task_file_data;
reply = port->rxfis + RX_FIS_D2H_REG;
task_file_data = readl(port->mmio+PORT_TFDATA);
if ((task_file_data & 1))
return false;
@ -1020,7 +1017,6 @@ static int mtip_exec_internal_command(struct mtip_port *port,
.opts = opts
};
int rv = 0;
unsigned long start;
/* Make sure the buffer is 8 byte aligned. This is asic specific. */
if (buffer & 0x00000007) {
@ -1057,7 +1053,6 @@ static int mtip_exec_internal_command(struct mtip_port *port,
/* Copy the command to the command table */
memcpy(int_cmd->command, fis, fis_len*4);
start = jiffies;
rq->timeout = timeout;
/* insert request and run queue */
@ -3015,7 +3010,6 @@ static int mtip_hw_init(struct driver_data *dd)
{
int i;
int rv;
unsigned int num_command_slots;
unsigned long timeout, timetaken;
dd->mmio = pcim_iomap_table(dd->pdev)[MTIP_ABAR];
@ -3025,7 +3019,6 @@ static int mtip_hw_init(struct driver_data *dd)
rv = -EIO;
goto out1;
}
num_command_slots = dd->slot_groups * 32;
hba_setup(dd);

View File

@ -288,15 +288,6 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
cmd->status = BLK_STS_TIMEOUT;
return BLK_EH_HANDLED;
}
/* If we are waiting on our dead timer then we could get timeout
* callbacks for our request. For this we just want to reset the timer
* and let the queue side take care of everything.
*/
if (!completion_done(&cmd->send_complete)) {
nbd_config_put(nbd);
return BLK_EH_RESET_TIMER;
}
config = nbd->config;
if (config->num_connections > 1) {
@ -723,9 +714,9 @@ static int wait_for_reconnect(struct nbd_device *nbd)
return 0;
if (test_bit(NBD_DISCONNECTED, &config->runtime_flags))
return 0;
wait_event_interruptible_timeout(config->conn_wait,
atomic_read(&config->live_connections),
config->dead_conn_timeout);
wait_event_timeout(config->conn_wait,
atomic_read(&config->live_connections),
config->dead_conn_timeout);
return atomic_read(&config->live_connections);
}
@ -740,6 +731,7 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
if (!refcount_inc_not_zero(&nbd->config_refs)) {
dev_err_ratelimited(disk_to_dev(nbd->disk),
"Socks array is empty\n");
blk_mq_start_request(req);
return -EINVAL;
}
config = nbd->config;
@ -748,6 +740,7 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
dev_err_ratelimited(disk_to_dev(nbd->disk),
"Attempted send on invalid socket\n");
nbd_config_put(nbd);
blk_mq_start_request(req);
return -EINVAL;
}
cmd->status = BLK_STS_OK;
@ -771,6 +764,7 @@ again:
*/
sock_shutdown(nbd);
nbd_config_put(nbd);
blk_mq_start_request(req);
return -EIO;
}
goto again;
@ -781,6 +775,7 @@ again:
* here so that it gets put _after_ the request that is already on the
* dispatch list.
*/
blk_mq_start_request(req);
if (unlikely(nsock->pending && nsock->pending != req)) {
blk_mq_requeue_request(req, true);
ret = 0;
@ -793,10 +788,10 @@ again:
ret = nbd_send_cmd(nbd, cmd, index);
if (ret == -EAGAIN) {
dev_err_ratelimited(disk_to_dev(nbd->disk),
"Request send failed trying another connection\n");
"Request send failed, requeueing\n");
nbd_mark_nsock_dead(nbd, nsock, 1);
mutex_unlock(&nsock->tx_lock);
goto again;
blk_mq_requeue_request(req, true);
ret = 0;
}
out:
mutex_unlock(&nsock->tx_lock);
@ -820,7 +815,6 @@ static blk_status_t nbd_queue_rq(struct blk_mq_hw_ctx *hctx,
* done sending everything over the wire.
*/
init_completion(&cmd->send_complete);
blk_mq_start_request(bd->rq);
/* We can be called directly from the user space process, which means we
* could possibly have signals pending so our sendmsg will fail. In

View File

@ -154,6 +154,10 @@ enum {
NULL_Q_MQ = 2,
};
static int g_no_sched;
module_param_named(no_sched, g_no_sched, int, S_IRUGO);
MODULE_PARM_DESC(no_sched, "No io scheduler");
static int g_submit_queues = 1;
module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
MODULE_PARM_DESC(submit_queues, "Number of submission queues");
@ -1754,6 +1758,8 @@ static int null_init_tag_set(struct nullb *nullb, struct blk_mq_tag_set *set)
set->numa_node = nullb ? nullb->dev->home_node : g_home_node;
set->cmd_size = sizeof(struct nullb_cmd);
set->flags = BLK_MQ_F_SHOULD_MERGE;
if (g_no_sched)
set->flags |= BLK_MQ_F_NO_SCHED;
set->driver_data = NULL;
if ((nullb && nullb->dev->blocking) || g_blocking)
@ -1985,8 +1991,10 @@ static int __init null_init(void)
for (i = 0; i < nr_devices; i++) {
dev = null_alloc_dev();
if (!dev)
if (!dev) {
ret = -ENOMEM;
goto err_dev;
}
ret = null_add_dev(dev);
if (ret) {
null_free_dev(dev);

View File

@ -26,6 +26,7 @@ config PARIDE_PD
config PARIDE_PCD
tristate "Parallel port ATAPI CD-ROMs"
depends on PARIDE
select CDROM
select BLK_SCSI_REQUEST # only for the generic cdrom code
---help---
This option enables the high-level driver for ATAPI CD-ROM devices

View File

@ -1967,7 +1967,8 @@ static void skd_isr_msg_from_dev(struct skd_device *skdev)
break;
case FIT_MTD_CMD_LOG_HOST_ID:
skdev->connect_time_stamp = get_seconds();
/* hardware interface overflows in y2106 */
skdev->connect_time_stamp = (u32)ktime_get_real_seconds();
data = skdev->connect_time_stamp & 0xFFFF;
mtd = FIT_MXD_CONS(FIT_MTD_CMD_LOG_TIME_STAMP_LO, 0, data);
SKD_WRITEL(skdev, mtd, FIT_MSG_TO_DEVICE);

View File

@ -1,14 +1,3 @@
# SPDX-License-Identifier: GPL-2.0
# Makefile for the kernel cdrom device drivers.
#
# 30 Jan 1998, Michael Elizabeth Chastain, <mailto:mec@shout.net>
# Rewritten to use lists instead of if-statements.
# Each configuration option enables a list of files.
obj-$(CONFIG_BLK_DEV_IDECD) += cdrom.o
obj-$(CONFIG_BLK_DEV_SR) += cdrom.o
obj-$(CONFIG_PARIDE_PCD) += cdrom.o
obj-$(CONFIG_CDROM_PKTCDVD) += cdrom.o
obj-$(CONFIG_GDROM) += gdrom.o cdrom.o
obj-$(CONFIG_CDROM) += cdrom.o
obj-$(CONFIG_GDROM) += gdrom.o

View File

@ -117,7 +117,9 @@ config BLK_DEV_DELKIN
config BLK_DEV_IDECD
tristate "Include IDE/ATAPI CDROM support"
depends on BLK_DEV
select IDE_ATAPI
select CDROM
---help---
If you have a CD-ROM drive using the ATAPI protocol, say Y. ATAPI is
a newer protocol used by IDE CD-ROM and TAPE drives, similar to the

View File

@ -282,7 +282,7 @@ int ide_cd_expiry(ide_drive_t *drive)
struct request *rq = drive->hwif->rq;
unsigned long wait = 0;
debug_log("%s: rq->cmd[0]: 0x%x\n", __func__, rq->cmd[0]);
debug_log("%s: scsi_req(rq)->cmd[0]: 0x%x\n", __func__, scsi_req(rq)->cmd[0]);
/*
* Some commands are *slow* and normally take a long time to complete.
@ -463,7 +463,7 @@ static ide_startstop_t ide_pc_intr(ide_drive_t *drive)
return ide_do_reset(drive);
}
debug_log("[cmd %x]: check condition\n", rq->cmd[0]);
debug_log("[cmd %x]: check condition\n", scsi_req(rq)->cmd[0]);
/* Retry operation */
ide_retry_pc(drive);
@ -531,7 +531,7 @@ static ide_startstop_t ide_pc_intr(ide_drive_t *drive)
ide_pad_transfer(drive, write, bcount);
debug_log("[cmd %x] transferred %d bytes, padded %d bytes, resid: %u\n",
rq->cmd[0], done, bcount, scsi_req(rq)->resid_len);
scsi_req(rq)->cmd[0], done, bcount, scsi_req(rq)->resid_len);
/* And set the interrupt handler again */
ide_set_handler(drive, ide_pc_intr, timeout);

View File

@ -90,9 +90,9 @@ int generic_ide_resume(struct device *dev)
}
memset(&rqpm, 0, sizeof(rqpm));
rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
rq = blk_get_request_flags(drive->queue, REQ_OP_DRV_IN,
BLK_MQ_REQ_PREEMPT);
ide_req(rq)->type = ATA_PRIV_PM_RESUME;
rq->rq_flags |= RQF_PREEMPT;
rq->special = &rqpm;
rqpm.pm_step = IDE_PM_START_RESUME;
rqpm.pm_state = PM_EVENT_ON;

View File

@ -4,7 +4,8 @@
menuconfig NVM
bool "Open-Channel SSD target support"
depends on BLOCK && HAS_DMA
depends on BLOCK && HAS_DMA && PCI
select BLK_DEV_NVME
help
Say Y here to get to enable Open-channel SSDs.

View File

@ -22,6 +22,7 @@
#include <linux/types.h>
#include <linux/sem.h>
#include <linux/bitmap.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/miscdevice.h>
#include <linux/lightnvm.h>
@ -138,7 +139,6 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct nvm_dev *dev,
int prev_nr_luns;
int i, j;
nr_chnls = nr_luns / dev->geo.luns_per_chnl;
nr_chnls = (nr_chnls_mod == 0) ? nr_chnls : nr_chnls + 1;
dev_map = kmalloc(sizeof(struct nvm_dev_map), GFP_KERNEL);
@ -226,6 +226,24 @@ static const struct block_device_operations nvm_fops = {
.owner = THIS_MODULE,
};
static struct nvm_tgt_type *nvm_find_target_type(const char *name, int lock)
{
struct nvm_tgt_type *tmp, *tt = NULL;
if (lock)
down_write(&nvm_tgtt_lock);
list_for_each_entry(tmp, &nvm_tgt_types, list)
if (!strcmp(name, tmp->name)) {
tt = tmp;
break;
}
if (lock)
up_write(&nvm_tgtt_lock);
return tt;
}
static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
{
struct nvm_ioctl_create_simple *s = &create->conf.s;
@ -316,6 +334,8 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
list_add_tail(&t->list, &dev->targets);
mutex_unlock(&dev->mlock);
__module_get(tt->owner);
return 0;
err_sysfs:
if (tt->exit)
@ -351,6 +371,7 @@ static void __nvm_remove_target(struct nvm_target *t)
nvm_remove_tgt_dev(t->dev, 1);
put_disk(tdisk);
module_put(t->type->owner);
list_del(&t->list);
kfree(t);
@ -532,25 +553,6 @@ void nvm_part_to_tgt(struct nvm_dev *dev, sector_t *entries,
}
EXPORT_SYMBOL(nvm_part_to_tgt);
struct nvm_tgt_type *nvm_find_target_type(const char *name, int lock)
{
struct nvm_tgt_type *tmp, *tt = NULL;
if (lock)
down_write(&nvm_tgtt_lock);
list_for_each_entry(tmp, &nvm_tgt_types, list)
if (!strcmp(name, tmp->name)) {
tt = tmp;
break;
}
if (lock)
up_write(&nvm_tgtt_lock);
return tt;
}
EXPORT_SYMBOL(nvm_find_target_type);
int nvm_register_tgt_type(struct nvm_tgt_type *tt)
{
int ret = 0;
@ -571,9 +573,9 @@ void nvm_unregister_tgt_type(struct nvm_tgt_type *tt)
if (!tt)
return;
down_write(&nvm_lock);
down_write(&nvm_tgtt_lock);
list_del(&tt->list);
up_write(&nvm_lock);
up_write(&nvm_tgtt_lock);
}
EXPORT_SYMBOL(nvm_unregister_tgt_type);
@ -602,6 +604,52 @@ static struct nvm_dev *nvm_find_nvm_dev(const char *name)
return NULL;
}
static int nvm_set_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd,
const struct ppa_addr *ppas, int nr_ppas)
{
struct nvm_dev *dev = tgt_dev->parent;
struct nvm_geo *geo = &tgt_dev->geo;
int i, plane_cnt, pl_idx;
struct ppa_addr ppa;
if (geo->plane_mode == NVM_PLANE_SINGLE && nr_ppas == 1) {
rqd->nr_ppas = nr_ppas;
rqd->ppa_addr = ppas[0];
return 0;
}
rqd->nr_ppas = nr_ppas;
rqd->ppa_list = nvm_dev_dma_alloc(dev, GFP_KERNEL, &rqd->dma_ppa_list);
if (!rqd->ppa_list) {
pr_err("nvm: failed to allocate dma memory\n");
return -ENOMEM;
}
plane_cnt = geo->plane_mode;
rqd->nr_ppas *= plane_cnt;
for (i = 0; i < nr_ppas; i++) {
for (pl_idx = 0; pl_idx < plane_cnt; pl_idx++) {
ppa = ppas[i];
ppa.g.pl = pl_idx;
rqd->ppa_list[(pl_idx * nr_ppas) + i] = ppa;
}
}
return 0;
}
static void nvm_free_rqd_ppalist(struct nvm_tgt_dev *tgt_dev,
struct nvm_rq *rqd)
{
if (!rqd->ppa_list)
return;
nvm_dev_dma_free(tgt_dev->parent, rqd->ppa_list, rqd->dma_ppa_list);
}
int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas,
int nr_ppas, int type)
{
@ -616,7 +664,7 @@ int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas,
memset(&rqd, 0, sizeof(struct nvm_rq));
nvm_set_rqd_ppalist(tgt_dev, &rqd, ppas, nr_ppas, 1);
nvm_set_rqd_ppalist(tgt_dev, &rqd, ppas, nr_ppas);
nvm_rq_tgt_to_dev(tgt_dev, &rqd);
ret = dev->ops->set_bb_tbl(dev, &rqd.ppa_addr, rqd.nr_ppas, type);
@ -658,12 +706,25 @@ int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd)
}
EXPORT_SYMBOL(nvm_submit_io);
static void nvm_end_io_sync(struct nvm_rq *rqd)
int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd)
{
struct completion *waiting = rqd->private;
struct nvm_dev *dev = tgt_dev->parent;
int ret;
complete(waiting);
if (!dev->ops->submit_io_sync)
return -ENODEV;
nvm_rq_tgt_to_dev(tgt_dev, rqd);
rqd->dev = tgt_dev;
/* In case of error, fail with right address format */
ret = dev->ops->submit_io_sync(dev, rqd);
nvm_rq_dev_to_tgt(tgt_dev, rqd);
return ret;
}
EXPORT_SYMBOL(nvm_submit_io_sync);
int nvm_erase_sync(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas,
int nr_ppas)
@ -671,25 +732,21 @@ int nvm_erase_sync(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas,
struct nvm_geo *geo = &tgt_dev->geo;
struct nvm_rq rqd;
int ret;
DECLARE_COMPLETION_ONSTACK(wait);
memset(&rqd, 0, sizeof(struct nvm_rq));
rqd.opcode = NVM_OP_ERASE;
rqd.end_io = nvm_end_io_sync;
rqd.private = &wait;
rqd.flags = geo->plane_mode >> 1;
ret = nvm_set_rqd_ppalist(tgt_dev, &rqd, ppas, nr_ppas, 1);
ret = nvm_set_rqd_ppalist(tgt_dev, &rqd, ppas, nr_ppas);
if (ret)
return ret;
ret = nvm_submit_io(tgt_dev, &rqd);
ret = nvm_submit_io_sync(tgt_dev, &rqd);
if (ret) {
pr_err("rrpr: erase I/O submission failed: %d\n", ret);
goto free_ppa_list;
}
wait_for_completion_io(&wait);
free_ppa_list:
nvm_free_rqd_ppalist(tgt_dev, &rqd);
@ -775,57 +832,6 @@ void nvm_put_area(struct nvm_tgt_dev *tgt_dev, sector_t begin)
}
EXPORT_SYMBOL(nvm_put_area);
int nvm_set_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd,
const struct ppa_addr *ppas, int nr_ppas, int vblk)
{
struct nvm_dev *dev = tgt_dev->parent;
struct nvm_geo *geo = &tgt_dev->geo;
int i, plane_cnt, pl_idx;
struct ppa_addr ppa;
if ((!vblk || geo->plane_mode == NVM_PLANE_SINGLE) && nr_ppas == 1) {
rqd->nr_ppas = nr_ppas;
rqd->ppa_addr = ppas[0];
return 0;
}
rqd->nr_ppas = nr_ppas;
rqd->ppa_list = nvm_dev_dma_alloc(dev, GFP_KERNEL, &rqd->dma_ppa_list);
if (!rqd->ppa_list) {
pr_err("nvm: failed to allocate dma memory\n");
return -ENOMEM;
}
if (!vblk) {
for (i = 0; i < nr_ppas; i++)
rqd->ppa_list[i] = ppas[i];
} else {
plane_cnt = geo->plane_mode;
rqd->nr_ppas *= plane_cnt;
for (i = 0; i < nr_ppas; i++) {
for (pl_idx = 0; pl_idx < plane_cnt; pl_idx++) {
ppa = ppas[i];
ppa.g.pl = pl_idx;
rqd->ppa_list[(pl_idx * nr_ppas) + i] = ppa;
}
}
}
return 0;
}
EXPORT_SYMBOL(nvm_set_rqd_ppalist);
void nvm_free_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd)
{
if (!rqd->ppa_list)
return;
nvm_dev_dma_free(tgt_dev->parent, rqd->ppa_list, rqd->dma_ppa_list);
}
EXPORT_SYMBOL(nvm_free_rqd_ppalist);
void nvm_end_io(struct nvm_rq *rqd)
{
struct nvm_tgt_dev *tgt_dev = rqd->dev;
@ -1177,7 +1183,7 @@ static long nvm_ioctl_info(struct file *file, void __user *arg)
info->version[1] = NVM_VERSION_MINOR;
info->version[2] = NVM_VERSION_PATCH;
down_write(&nvm_lock);
down_write(&nvm_tgtt_lock);
list_for_each_entry(tt, &nvm_tgt_types, list) {
struct nvm_ioctl_info_tgt *tgt = &info->tgts[tgt_iter];
@ -1190,7 +1196,7 @@ static long nvm_ioctl_info(struct file *file, void __user *arg)
}
info->tgtsize = tgt_iter;
up_write(&nvm_lock);
up_write(&nvm_tgtt_lock);
if (copy_to_user(arg, info, sizeof(struct nvm_ioctl_info))) {
kfree(info);

View File

@ -43,8 +43,10 @@ retry:
if (unlikely(!bio_has_data(bio)))
goto out;
w_ctx.flags = flags;
pblk_ppa_set_empty(&w_ctx.ppa);
w_ctx.flags = flags;
if (bio->bi_opf & REQ_PREFLUSH)
w_ctx.flags |= PBLK_FLUSH_ENTRY;
for (i = 0; i < nr_entries; i++) {
void *data = bio_data(bio);
@ -73,12 +75,11 @@ out:
* On GC the incoming lbas are not necessarily sequential. Also, some of the
* lbas might not be valid entries, which are marked as empty by the GC thread
*/
int pblk_write_gc_to_cache(struct pblk *pblk, void *data, u64 *lba_list,
unsigned int nr_entries, unsigned int nr_rec_entries,
struct pblk_line *gc_line, unsigned long flags)
int pblk_write_gc_to_cache(struct pblk *pblk, struct pblk_gc_rq *gc_rq)
{
struct pblk_w_ctx w_ctx;
unsigned int bpos, pos;
void *data = gc_rq->data;
int i, valid_entries;
/* Update the write buffer head (mem) with the entries that we can
@ -86,28 +87,29 @@ int pblk_write_gc_to_cache(struct pblk *pblk, void *data, u64 *lba_list,
* rollback from here on.
*/
retry:
if (!pblk_rb_may_write_gc(&pblk->rwb, nr_rec_entries, &bpos)) {
if (!pblk_rb_may_write_gc(&pblk->rwb, gc_rq->secs_to_gc, &bpos)) {
io_schedule();
goto retry;
}
w_ctx.flags = flags;
w_ctx.flags = PBLK_IOTYPE_GC;
pblk_ppa_set_empty(&w_ctx.ppa);
for (i = 0, valid_entries = 0; i < nr_entries; i++) {
if (lba_list[i] == ADDR_EMPTY)
for (i = 0, valid_entries = 0; i < gc_rq->nr_secs; i++) {
if (gc_rq->lba_list[i] == ADDR_EMPTY)
continue;
w_ctx.lba = lba_list[i];
w_ctx.lba = gc_rq->lba_list[i];
pos = pblk_rb_wrap_pos(&pblk->rwb, bpos + valid_entries);
pblk_rb_write_entry_gc(&pblk->rwb, data, w_ctx, gc_line, pos);
pblk_rb_write_entry_gc(&pblk->rwb, data, w_ctx, gc_rq->line,
gc_rq->paddr_list[i], pos);
data += PBLK_EXPOSED_PAGE_SIZE;
valid_entries++;
}
WARN_ONCE(nr_rec_entries != valid_entries,
WARN_ONCE(gc_rq->secs_to_gc != valid_entries,
"pblk: inconsistent GC write\n");
#ifdef CONFIG_NVM_DEBUG

View File

@ -18,6 +18,31 @@
#include "pblk.h"
static void pblk_line_mark_bb(struct work_struct *work)
{
struct pblk_line_ws *line_ws = container_of(work, struct pblk_line_ws,
ws);
struct pblk *pblk = line_ws->pblk;
struct nvm_tgt_dev *dev = pblk->dev;
struct ppa_addr *ppa = line_ws->priv;
int ret;
ret = nvm_set_tgt_bb_tbl(dev, ppa, 1, NVM_BLK_T_GRWN_BAD);
if (ret) {
struct pblk_line *line;
int pos;
line = &pblk->lines[pblk_dev_ppa_to_line(*ppa)];
pos = pblk_dev_ppa_to_pos(&dev->geo, *ppa);
pr_err("pblk: failed to mark bb, line:%d, pos:%d\n",
line->id, pos);
}
kfree(ppa);
mempool_free(line_ws, pblk->gen_ws_pool);
}
static void pblk_mark_bb(struct pblk *pblk, struct pblk_line *line,
struct ppa_addr *ppa)
{
@ -33,7 +58,8 @@ static void pblk_mark_bb(struct pblk *pblk, struct pblk_line *line,
pr_err("pblk: attempted to erase bb: line:%d, pos:%d\n",
line->id, pos);
pblk_line_run_ws(pblk, NULL, ppa, pblk_line_mark_bb, pblk->bb_wq);
pblk_gen_run_ws(pblk, NULL, ppa, pblk_line_mark_bb,
GFP_ATOMIC, pblk->bb_wq);
}
static void __pblk_end_io_erase(struct pblk *pblk, struct nvm_rq *rqd)
@ -63,7 +89,7 @@ static void pblk_end_io_erase(struct nvm_rq *rqd)
struct pblk *pblk = rqd->private;
__pblk_end_io_erase(pblk, rqd);
mempool_free(rqd, pblk->g_rq_pool);
mempool_free(rqd, pblk->e_rq_pool);
}
void __pblk_map_invalidate(struct pblk *pblk, struct pblk_line *line,
@ -77,11 +103,7 @@ void __pblk_map_invalidate(struct pblk *pblk, struct pblk_line *line,
* that newer updates are not overwritten.
*/
spin_lock(&line->lock);
if (line->state == PBLK_LINESTATE_GC ||
line->state == PBLK_LINESTATE_FREE) {
spin_unlock(&line->lock);
return;
}
WARN_ON(line->state == PBLK_LINESTATE_FREE);
if (test_and_set_bit(paddr, line->invalid_bitmap)) {
WARN_ONCE(1, "pblk: double invalidate\n");
@ -98,8 +120,7 @@ void __pblk_map_invalidate(struct pblk *pblk, struct pblk_line *line,
spin_lock(&l_mg->gc_lock);
spin_lock(&line->lock);
/* Prevent moving a line that has just been chosen for GC */
if (line->state == PBLK_LINESTATE_GC ||
line->state == PBLK_LINESTATE_FREE) {
if (line->state == PBLK_LINESTATE_GC) {
spin_unlock(&line->lock);
spin_unlock(&l_mg->gc_lock);
return;
@ -150,17 +171,25 @@ static void pblk_invalidate_range(struct pblk *pblk, sector_t slba,
spin_unlock(&pblk->trans_lock);
}
struct nvm_rq *pblk_alloc_rqd(struct pblk *pblk, int rw)
/* Caller must guarantee that the request is a valid type */
struct nvm_rq *pblk_alloc_rqd(struct pblk *pblk, int type)
{
mempool_t *pool;
struct nvm_rq *rqd;
int rq_size;
if (rw == WRITE) {
switch (type) {
case PBLK_WRITE:
case PBLK_WRITE_INT:
pool = pblk->w_rq_pool;
rq_size = pblk_w_rq_size;
} else {
pool = pblk->g_rq_pool;
break;
case PBLK_READ:
pool = pblk->r_rq_pool;
rq_size = pblk_g_rq_size;
break;
default:
pool = pblk->e_rq_pool;
rq_size = pblk_g_rq_size;
}
@ -170,15 +199,30 @@ struct nvm_rq *pblk_alloc_rqd(struct pblk *pblk, int rw)
return rqd;
}
void pblk_free_rqd(struct pblk *pblk, struct nvm_rq *rqd, int rw)
/* Typically used on completion path. Cannot guarantee request consistency */
void pblk_free_rqd(struct pblk *pblk, struct nvm_rq *rqd, int type)
{
struct nvm_tgt_dev *dev = pblk->dev;
mempool_t *pool;
if (rw == WRITE)
switch (type) {
case PBLK_WRITE:
kfree(((struct pblk_c_ctx *)nvm_rq_to_pdu(rqd))->lun_bitmap);
case PBLK_WRITE_INT:
pool = pblk->w_rq_pool;
else
pool = pblk->g_rq_pool;
break;
case PBLK_READ:
pool = pblk->r_rq_pool;
break;
case PBLK_ERASE:
pool = pblk->e_rq_pool;
break;
default:
pr_err("pblk: trying to free unknown rqd type\n");
return;
}
nvm_dev_dma_free(dev->parent, rqd->meta_list, rqd->dma_meta_list);
mempool_free(rqd, pool);
}
@ -190,10 +234,9 @@ void pblk_bio_free_pages(struct pblk *pblk, struct bio *bio, int off,
WARN_ON(off + nr_pages != bio->bi_vcnt);
bio_advance(bio, off * PBLK_EXPOSED_PAGE_SIZE);
for (i = off; i < nr_pages + off; i++) {
bv = bio->bi_io_vec[i];
mempool_free(bv.bv_page, pblk->page_pool);
mempool_free(bv.bv_page, pblk->page_bio_pool);
}
}
@ -205,14 +248,12 @@ int pblk_bio_add_pages(struct pblk *pblk, struct bio *bio, gfp_t flags,
int i, ret;
for (i = 0; i < nr_pages; i++) {
page = mempool_alloc(pblk->page_pool, flags);
if (!page)
goto err;
page = mempool_alloc(pblk->page_bio_pool, flags);
ret = bio_add_pc_page(q, bio, page, PBLK_EXPOSED_PAGE_SIZE, 0);
if (ret != PBLK_EXPOSED_PAGE_SIZE) {
pr_err("pblk: could not add page to bio\n");
mempool_free(page, pblk->page_pool);
mempool_free(page, pblk->page_bio_pool);
goto err;
}
}
@ -245,13 +286,6 @@ void pblk_write_should_kick(struct pblk *pblk)
pblk_write_kick(pblk);
}
void pblk_end_bio_sync(struct bio *bio)
{
struct completion *waiting = bio->bi_private;
complete(waiting);
}
void pblk_end_io_sync(struct nvm_rq *rqd)
{
struct completion *waiting = rqd->private;
@ -259,7 +293,7 @@ void pblk_end_io_sync(struct nvm_rq *rqd)
complete(waiting);
}
void pblk_wait_for_meta(struct pblk *pblk)
static void pblk_wait_for_meta(struct pblk *pblk)
{
do {
if (!atomic_read(&pblk->inflight_io))
@ -336,17 +370,6 @@ void pblk_discard(struct pblk *pblk, struct bio *bio)
pblk_invalidate_range(pblk, slba, nr_secs);
}
struct ppa_addr pblk_get_lba_map(struct pblk *pblk, sector_t lba)
{
struct ppa_addr ppa;
spin_lock(&pblk->trans_lock);
ppa = pblk_trans_map_get(pblk, lba);
spin_unlock(&pblk->trans_lock);
return ppa;
}
void pblk_log_write_err(struct pblk *pblk, struct nvm_rq *rqd)
{
atomic_long_inc(&pblk->write_failed);
@ -389,34 +412,11 @@ int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd)
struct nvm_tgt_dev *dev = pblk->dev;
#ifdef CONFIG_NVM_DEBUG
struct ppa_addr *ppa_list;
int ret;
ppa_list = (rqd->nr_ppas > 1) ? rqd->ppa_list : &rqd->ppa_addr;
if (pblk_boundary_ppa_checks(dev, ppa_list, rqd->nr_ppas)) {
WARN_ON(1);
return -EINVAL;
}
if (rqd->opcode == NVM_OP_PWRITE) {
struct pblk_line *line;
struct ppa_addr ppa;
int i;
for (i = 0; i < rqd->nr_ppas; i++) {
ppa = ppa_list[i];
line = &pblk->lines[pblk_dev_ppa_to_line(ppa)];
spin_lock(&line->lock);
if (line->state != PBLK_LINESTATE_OPEN) {
pr_err("pblk: bad ppa: line:%d,state:%d\n",
line->id, line->state);
WARN_ON(1);
spin_unlock(&line->lock);
return -EINVAL;
}
spin_unlock(&line->lock);
}
}
ret = pblk_check_io(pblk, rqd);
if (ret)
return ret;
#endif
atomic_inc(&pblk->inflight_io);
@ -424,6 +424,28 @@ int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd)
return nvm_submit_io(dev, rqd);
}
int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd)
{
struct nvm_tgt_dev *dev = pblk->dev;
#ifdef CONFIG_NVM_DEBUG
int ret;
ret = pblk_check_io(pblk, rqd);
if (ret)
return ret;
#endif
atomic_inc(&pblk->inflight_io);
return nvm_submit_io_sync(dev, rqd);
}
static void pblk_bio_map_addr_endio(struct bio *bio)
{
bio_put(bio);
}
struct bio *pblk_bio_map_addr(struct pblk *pblk, void *data,
unsigned int nr_secs, unsigned int len,
int alloc_type, gfp_t gfp_mask)
@ -460,6 +482,8 @@ struct bio *pblk_bio_map_addr(struct pblk *pblk, void *data,
kaddr += PAGE_SIZE;
}
bio->bi_end_io = pblk_bio_map_addr_endio;
out:
return bio;
}
@ -486,12 +510,14 @@ void pblk_dealloc_page(struct pblk *pblk, struct pblk_line *line, int nr_secs)
u64 addr;
int i;
spin_lock(&line->lock);
addr = find_next_zero_bit(line->map_bitmap,
pblk->lm.sec_per_line, line->cur_sec);
line->cur_sec = addr - nr_secs;
for (i = 0; i < nr_secs; i++, line->cur_sec--)
WARN_ON(!test_and_clear_bit(line->cur_sec, line->map_bitmap));
spin_unlock(&line->lock);
}
u64 __pblk_alloc_page(struct pblk *pblk, struct pblk_line *line, int nr_secs)
@ -565,12 +591,11 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, struct pblk_line *line,
int cmd_op, bio_op;
int i, j;
int ret;
DECLARE_COMPLETION_ONSTACK(wait);
if (dir == WRITE) {
if (dir == PBLK_WRITE) {
bio_op = REQ_OP_WRITE;
cmd_op = NVM_OP_PWRITE;
} else if (dir == READ) {
} else if (dir == PBLK_READ) {
bio_op = REQ_OP_READ;
cmd_op = NVM_OP_PREAD;
} else
@ -607,13 +632,11 @@ next_rq:
rqd.dma_ppa_list = dma_ppa_list;
rqd.opcode = cmd_op;
rqd.nr_ppas = rq_ppas;
rqd.end_io = pblk_end_io_sync;
rqd.private = &wait;
if (dir == WRITE) {
if (dir == PBLK_WRITE) {
struct pblk_sec_meta *meta_list = rqd.meta_list;
rqd.flags = pblk_set_progr_mode(pblk, WRITE);
rqd.flags = pblk_set_progr_mode(pblk, PBLK_WRITE);
for (i = 0; i < rqd.nr_ppas; ) {
spin_lock(&line->lock);
paddr = __pblk_alloc_page(pblk, line, min);
@ -662,25 +685,17 @@ next_rq:
}
}
ret = pblk_submit_io(pblk, &rqd);
ret = pblk_submit_io_sync(pblk, &rqd);
if (ret) {
pr_err("pblk: emeta I/O submission failed: %d\n", ret);
bio_put(bio);
goto free_rqd_dma;
}
if (!wait_for_completion_io_timeout(&wait,
msecs_to_jiffies(PBLK_COMMAND_TIMEOUT_MS))) {
pr_err("pblk: emeta I/O timed out\n");
}
atomic_dec(&pblk->inflight_io);
reinit_completion(&wait);
if (likely(pblk->l_mg.emeta_alloc_type == PBLK_VMALLOC_META))
bio_put(bio);
if (rqd.error) {
if (dir == WRITE)
if (dir == PBLK_WRITE)
pblk_log_write_err(pblk, &rqd);
else
pblk_log_read_err(pblk, &rqd);
@ -721,14 +736,13 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, struct pblk_line *line,
int i, ret;
int cmd_op, bio_op;
int flags;
DECLARE_COMPLETION_ONSTACK(wait);
if (dir == WRITE) {
if (dir == PBLK_WRITE) {
bio_op = REQ_OP_WRITE;
cmd_op = NVM_OP_PWRITE;
flags = pblk_set_progr_mode(pblk, WRITE);
flags = pblk_set_progr_mode(pblk, PBLK_WRITE);
lba_list = emeta_to_lbas(pblk, line->emeta->buf);
} else if (dir == READ) {
} else if (dir == PBLK_READ) {
bio_op = REQ_OP_READ;
cmd_op = NVM_OP_PREAD;
flags = pblk_set_read_mode(pblk, PBLK_READ_SEQUENTIAL);
@ -758,15 +772,13 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, struct pblk_line *line,
rqd.opcode = cmd_op;
rqd.flags = flags;
rqd.nr_ppas = lm->smeta_sec;
rqd.end_io = pblk_end_io_sync;
rqd.private = &wait;
for (i = 0; i < lm->smeta_sec; i++, paddr++) {
struct pblk_sec_meta *meta_list = rqd.meta_list;
rqd.ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line->id);
if (dir == WRITE) {
if (dir == PBLK_WRITE) {
__le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
meta_list[i].lba = lba_list[paddr] = addr_empty;
@ -778,21 +790,17 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, struct pblk_line *line,
* the write thread is the only one sending write and erase commands,
* there is no need to take the LUN semaphore.
*/
ret = pblk_submit_io(pblk, &rqd);
ret = pblk_submit_io_sync(pblk, &rqd);
if (ret) {
pr_err("pblk: smeta I/O submission failed: %d\n", ret);
bio_put(bio);
goto free_ppa_list;
}
if (!wait_for_completion_io_timeout(&wait,
msecs_to_jiffies(PBLK_COMMAND_TIMEOUT_MS))) {
pr_err("pblk: smeta I/O timed out\n");
}
atomic_dec(&pblk->inflight_io);
if (rqd.error) {
if (dir == WRITE)
if (dir == PBLK_WRITE)
pblk_log_write_err(pblk, &rqd);
else
pblk_log_read_err(pblk, &rqd);
@ -808,14 +816,14 @@ int pblk_line_read_smeta(struct pblk *pblk, struct pblk_line *line)
{
u64 bpaddr = pblk_line_smeta_start(pblk, line);
return pblk_line_submit_smeta_io(pblk, line, bpaddr, READ);
return pblk_line_submit_smeta_io(pblk, line, bpaddr, PBLK_READ);
}
int pblk_line_read_emeta(struct pblk *pblk, struct pblk_line *line,
void *emeta_buf)
{
return pblk_line_submit_emeta_io(pblk, line, emeta_buf,
line->emeta_ssec, READ);
line->emeta_ssec, PBLK_READ);
}
static void pblk_setup_e_rq(struct pblk *pblk, struct nvm_rq *rqd,
@ -824,7 +832,7 @@ static void pblk_setup_e_rq(struct pblk *pblk, struct nvm_rq *rqd,
rqd->opcode = NVM_OP_ERASE;
rqd->ppa_addr = ppa;
rqd->nr_ppas = 1;
rqd->flags = pblk_set_progr_mode(pblk, ERASE);
rqd->flags = pblk_set_progr_mode(pblk, PBLK_ERASE);
rqd->bio = NULL;
}
@ -832,19 +840,15 @@ static int pblk_blk_erase_sync(struct pblk *pblk, struct ppa_addr ppa)
{
struct nvm_rq rqd;
int ret = 0;
DECLARE_COMPLETION_ONSTACK(wait);
memset(&rqd, 0, sizeof(struct nvm_rq));
pblk_setup_e_rq(pblk, &rqd, ppa);
rqd.end_io = pblk_end_io_sync;
rqd.private = &wait;
/* The write thread schedules erases so that it minimizes disturbances
* with writes. Thus, there is no need to take the LUN semaphore.
*/
ret = pblk_submit_io(pblk, &rqd);
ret = pblk_submit_io_sync(pblk, &rqd);
if (ret) {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
@ -857,11 +861,6 @@ static int pblk_blk_erase_sync(struct pblk *pblk, struct ppa_addr ppa)
goto out;
}
if (!wait_for_completion_io_timeout(&wait,
msecs_to_jiffies(PBLK_COMMAND_TIMEOUT_MS))) {
pr_err("pblk: sync erase timed out\n");
}
out:
rqd.private = pblk;
__pblk_end_io_erase(pblk, &rqd);
@ -976,7 +975,7 @@ static int pblk_line_init_metadata(struct pblk *pblk, struct pblk_line *line,
memcpy(smeta_buf->header.uuid, pblk->instance_uuid, 16);
smeta_buf->header.id = cpu_to_le32(line->id);
smeta_buf->header.type = cpu_to_le16(line->type);
smeta_buf->header.version = cpu_to_le16(1);
smeta_buf->header.version = SMETA_VERSION;
/* Start metadata */
smeta_buf->seq_nr = cpu_to_le64(line->seq_nr);
@ -1046,7 +1045,7 @@ static int pblk_line_init_bb(struct pblk *pblk, struct pblk_line *line,
line->smeta_ssec = off;
line->cur_sec = off + lm->smeta_sec;
if (init && pblk_line_submit_smeta_io(pblk, line, off, WRITE)) {
if (init && pblk_line_submit_smeta_io(pblk, line, off, PBLK_WRITE)) {
pr_debug("pblk: line smeta I/O failed. Retry\n");
return 1;
}
@ -1056,7 +1055,6 @@ static int pblk_line_init_bb(struct pblk *pblk, struct pblk_line *line,
/* Mark emeta metadata sectors as bad sectors. We need to consider bad
* blocks to make sure that there are enough sectors to store emeta
*/
bit = lm->sec_per_line;
off = lm->sec_per_line - lm->emeta_sec[0];
bitmap_set(line->invalid_bitmap, off, lm->emeta_sec[0]);
while (nr_bb) {
@ -1093,25 +1091,21 @@ static int pblk_line_prepare(struct pblk *pblk, struct pblk_line *line)
struct pblk_line_meta *lm = &pblk->lm;
int blk_in_line = atomic_read(&line->blk_in_line);
line->map_bitmap = mempool_alloc(pblk->line_meta_pool, GFP_ATOMIC);
line->map_bitmap = kzalloc(lm->sec_bitmap_len, GFP_ATOMIC);
if (!line->map_bitmap)
return -ENOMEM;
memset(line->map_bitmap, 0, lm->sec_bitmap_len);
/* invalid_bitmap is special since it is used when line is closed. No
* need to zeroized; it will be initialized using bb info form
* map_bitmap
*/
line->invalid_bitmap = mempool_alloc(pblk->line_meta_pool, GFP_ATOMIC);
/* will be initialized using bb info from map_bitmap */
line->invalid_bitmap = kmalloc(lm->sec_bitmap_len, GFP_ATOMIC);
if (!line->invalid_bitmap) {
mempool_free(line->map_bitmap, pblk->line_meta_pool);
kfree(line->map_bitmap);
return -ENOMEM;
}
spin_lock(&line->lock);
if (line->state != PBLK_LINESTATE_FREE) {
mempool_free(line->invalid_bitmap, pblk->line_meta_pool);
mempool_free(line->map_bitmap, pblk->line_meta_pool);
kfree(line->map_bitmap);
kfree(line->invalid_bitmap);
spin_unlock(&line->lock);
WARN(1, "pblk: corrupted line %d, state %d\n",
line->id, line->state);
@ -1163,7 +1157,7 @@ int pblk_line_recov_alloc(struct pblk *pblk, struct pblk_line *line)
void pblk_line_recov_close(struct pblk *pblk, struct pblk_line *line)
{
mempool_free(line->map_bitmap, pblk->line_meta_pool);
kfree(line->map_bitmap);
line->map_bitmap = NULL;
line->smeta = NULL;
line->emeta = NULL;
@ -1328,6 +1322,41 @@ static void pblk_stop_writes(struct pblk *pblk, struct pblk_line *line)
pblk->state = PBLK_STATE_STOPPING;
}
static void pblk_line_close_meta_sync(struct pblk *pblk)
{
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
struct pblk_line_meta *lm = &pblk->lm;
struct pblk_line *line, *tline;
LIST_HEAD(list);
spin_lock(&l_mg->close_lock);
if (list_empty(&l_mg->emeta_list)) {
spin_unlock(&l_mg->close_lock);
return;
}
list_cut_position(&list, &l_mg->emeta_list, l_mg->emeta_list.prev);
spin_unlock(&l_mg->close_lock);
list_for_each_entry_safe(line, tline, &list, list) {
struct pblk_emeta *emeta = line->emeta;
while (emeta->mem < lm->emeta_len[0]) {
int ret;
ret = pblk_submit_meta_io(pblk, line);
if (ret) {
pr_err("pblk: sync meta line %d failed (%d)\n",
line->id, ret);
return;
}
}
}
pblk_wait_for_meta(pblk);
flush_workqueue(pblk->close_wq);
}
void pblk_pipeline_stop(struct pblk *pblk)
{
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
@ -1361,17 +1390,17 @@ void pblk_pipeline_stop(struct pblk *pblk)
spin_unlock(&l_mg->free_lock);
}
void pblk_line_replace_data(struct pblk *pblk)
struct pblk_line *pblk_line_replace_data(struct pblk *pblk)
{
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
struct pblk_line *cur, *new;
struct pblk_line *cur, *new = NULL;
unsigned int left_seblks;
int is_next = 0;
cur = l_mg->data_line;
new = l_mg->data_next;
if (!new)
return;
goto out;
l_mg->data_line = new;
spin_lock(&l_mg->free_lock);
@ -1379,7 +1408,7 @@ void pblk_line_replace_data(struct pblk *pblk)
l_mg->data_line = NULL;
l_mg->data_next = NULL;
spin_unlock(&l_mg->free_lock);
return;
goto out;
}
pblk_line_setup_metadata(new, l_mg, &pblk->lm);
@ -1391,7 +1420,7 @@ retry_erase:
/* If line is not fully erased, erase it */
if (atomic_read(&new->left_eblks)) {
if (pblk_line_erase(pblk, new))
return;
goto out;
} else {
io_schedule();
}
@ -1402,7 +1431,7 @@ retry_setup:
if (!pblk_line_init_metadata(pblk, new, cur)) {
new = pblk_line_retry(pblk, new);
if (!new)
return;
goto out;
goto retry_setup;
}
@ -1410,7 +1439,7 @@ retry_setup:
if (!pblk_line_init_bb(pblk, new, 1)) {
new = pblk_line_retry(pblk, new);
if (!new)
return;
goto out;
goto retry_setup;
}
@ -1434,14 +1463,15 @@ retry_setup:
if (is_next)
pblk_rl_free_lines_dec(&pblk->rl, l_mg->data_next);
out:
return new;
}
void pblk_line_free(struct pblk *pblk, struct pblk_line *line)
{
if (line->map_bitmap)
mempool_free(line->map_bitmap, pblk->line_meta_pool);
if (line->invalid_bitmap)
mempool_free(line->invalid_bitmap, pblk->line_meta_pool);
kfree(line->map_bitmap);
kfree(line->invalid_bitmap);
*line->vsc = cpu_to_le32(EMPTY_ENTRY);
@ -1451,11 +1481,10 @@ void pblk_line_free(struct pblk *pblk, struct pblk_line *line)
line->emeta = NULL;
}
void pblk_line_put(struct kref *ref)
static void __pblk_line_put(struct pblk *pblk, struct pblk_line *line)
{
struct pblk_line *line = container_of(ref, struct pblk_line, ref);
struct pblk *pblk = line->pblk;
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
struct pblk_gc *gc = &pblk->gc;
spin_lock(&line->lock);
WARN_ON(line->state != PBLK_LINESTATE_GC);
@ -1464,6 +1493,8 @@ void pblk_line_put(struct kref *ref)
pblk_line_free(pblk, line);
spin_unlock(&line->lock);
atomic_dec(&gc->pipeline_gc);
spin_lock(&l_mg->free_lock);
list_add_tail(&line->list, &l_mg->free_list);
l_mg->nr_free_lines++;
@ -1472,13 +1503,49 @@ void pblk_line_put(struct kref *ref)
pblk_rl_free_lines_inc(&pblk->rl, line);
}
static void pblk_line_put_ws(struct work_struct *work)
{
struct pblk_line_ws *line_put_ws = container_of(work,
struct pblk_line_ws, ws);
struct pblk *pblk = line_put_ws->pblk;
struct pblk_line *line = line_put_ws->line;
__pblk_line_put(pblk, line);
mempool_free(line_put_ws, pblk->gen_ws_pool);
}
void pblk_line_put(struct kref *ref)
{
struct pblk_line *line = container_of(ref, struct pblk_line, ref);
struct pblk *pblk = line->pblk;
__pblk_line_put(pblk, line);
}
void pblk_line_put_wq(struct kref *ref)
{
struct pblk_line *line = container_of(ref, struct pblk_line, ref);
struct pblk *pblk = line->pblk;
struct pblk_line_ws *line_put_ws;
line_put_ws = mempool_alloc(pblk->gen_ws_pool, GFP_ATOMIC);
if (!line_put_ws)
return;
line_put_ws->pblk = pblk;
line_put_ws->line = line;
line_put_ws->priv = NULL;
INIT_WORK(&line_put_ws->ws, pblk_line_put_ws);
queue_work(pblk->r_end_wq, &line_put_ws->ws);
}
int pblk_blk_erase_async(struct pblk *pblk, struct ppa_addr ppa)
{
struct nvm_rq *rqd;
int err;
rqd = mempool_alloc(pblk->g_rq_pool, GFP_KERNEL);
memset(rqd, 0, pblk_g_rq_size);
rqd = pblk_alloc_rqd(pblk, PBLK_ERASE);
pblk_setup_e_rq(pblk, rqd, ppa);
@ -1517,41 +1584,6 @@ int pblk_line_is_full(struct pblk_line *line)
return (line->left_msecs == 0);
}
void pblk_line_close_meta_sync(struct pblk *pblk)
{
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
struct pblk_line_meta *lm = &pblk->lm;
struct pblk_line *line, *tline;
LIST_HEAD(list);
spin_lock(&l_mg->close_lock);
if (list_empty(&l_mg->emeta_list)) {
spin_unlock(&l_mg->close_lock);
return;
}
list_cut_position(&list, &l_mg->emeta_list, l_mg->emeta_list.prev);
spin_unlock(&l_mg->close_lock);
list_for_each_entry_safe(line, tline, &list, list) {
struct pblk_emeta *emeta = line->emeta;
while (emeta->mem < lm->emeta_len[0]) {
int ret;
ret = pblk_submit_meta_io(pblk, line);
if (ret) {
pr_err("pblk: sync meta line %d failed (%d)\n",
line->id, ret);
return;
}
}
}
pblk_wait_for_meta(pblk);
flush_workqueue(pblk->close_wq);
}
static void pblk_line_should_sync_meta(struct pblk *pblk)
{
if (pblk_rl_is_limit(&pblk->rl))
@ -1582,15 +1614,13 @@ void pblk_line_close(struct pblk *pblk, struct pblk_line *line)
list_add_tail(&line->list, move_list);
mempool_free(line->map_bitmap, pblk->line_meta_pool);
kfree(line->map_bitmap);
line->map_bitmap = NULL;
line->smeta = NULL;
line->emeta = NULL;
spin_unlock(&line->lock);
spin_unlock(&l_mg->gc_lock);
pblk_gc_should_kick(pblk);
}
void pblk_line_close_meta(struct pblk *pblk, struct pblk_line *line)
@ -1624,43 +1654,16 @@ void pblk_line_close_ws(struct work_struct *work)
struct pblk_line *line = line_ws->line;
pblk_line_close(pblk, line);
mempool_free(line_ws, pblk->line_ws_pool);
mempool_free(line_ws, pblk->gen_ws_pool);
}
void pblk_line_mark_bb(struct work_struct *work)
{
struct pblk_line_ws *line_ws = container_of(work, struct pblk_line_ws,
ws);
struct pblk *pblk = line_ws->pblk;
struct nvm_tgt_dev *dev = pblk->dev;
struct ppa_addr *ppa = line_ws->priv;
int ret;
ret = nvm_set_tgt_bb_tbl(dev, ppa, 1, NVM_BLK_T_GRWN_BAD);
if (ret) {
struct pblk_line *line;
int pos;
line = &pblk->lines[pblk_dev_ppa_to_line(*ppa)];
pos = pblk_dev_ppa_to_pos(&dev->geo, *ppa);
pr_err("pblk: failed to mark bb, line:%d, pos:%d\n",
line->id, pos);
}
kfree(ppa);
mempool_free(line_ws, pblk->line_ws_pool);
}
void pblk_line_run_ws(struct pblk *pblk, struct pblk_line *line, void *priv,
void (*work)(struct work_struct *),
void pblk_gen_run_ws(struct pblk *pblk, struct pblk_line *line, void *priv,
void (*work)(struct work_struct *), gfp_t gfp_mask,
struct workqueue_struct *wq)
{
struct pblk_line_ws *line_ws;
line_ws = mempool_alloc(pblk->line_ws_pool, GFP_ATOMIC);
if (!line_ws)
return;
line_ws = mempool_alloc(pblk->gen_ws_pool, gfp_mask);
line_ws->pblk = pblk;
line_ws->line = line;
@ -1689,16 +1692,8 @@ static void __pblk_down_page(struct pblk *pblk, struct ppa_addr *ppa_list,
#endif
ret = down_timeout(&rlun->wr_sem, msecs_to_jiffies(30000));
if (ret) {
switch (ret) {
case -ETIME:
pr_err("pblk: lun semaphore timed out\n");
break;
case -EINTR:
pr_err("pblk: lun semaphore timed out\n");
break;
}
}
if (ret == -ETIME || ret == -EINTR)
pr_err("pblk: taking lun semaphore timed out: err %d\n", -ret);
}
void pblk_down_page(struct pblk *pblk, struct ppa_addr *ppa_list, int nr_ppas)
@ -1758,13 +1753,11 @@ void pblk_up_rq(struct pblk *pblk, struct ppa_addr *ppa_list, int nr_ppas,
rlun = &pblk->luns[bit];
up(&rlun->wr_sem);
}
kfree(lun_bitmap);
}
void pblk_update_map(struct pblk *pblk, sector_t lba, struct ppa_addr ppa)
{
struct ppa_addr l2p_ppa;
struct ppa_addr ppa_l2p;
/* logic error: lba out-of-bounds. Ignore update */
if (!(lba < pblk->rl.nr_secs)) {
@ -1773,10 +1766,10 @@ void pblk_update_map(struct pblk *pblk, sector_t lba, struct ppa_addr ppa)
}
spin_lock(&pblk->trans_lock);
l2p_ppa = pblk_trans_map_get(pblk, lba);
ppa_l2p = pblk_trans_map_get(pblk, lba);
if (!pblk_addr_in_cache(l2p_ppa) && !pblk_ppa_empty(l2p_ppa))
pblk_map_invalidate(pblk, l2p_ppa);
if (!pblk_addr_in_cache(ppa_l2p) && !pblk_ppa_empty(ppa_l2p))
pblk_map_invalidate(pblk, ppa_l2p);
pblk_trans_map_set(pblk, lba, ppa);
spin_unlock(&pblk->trans_lock);
@ -1784,6 +1777,7 @@ void pblk_update_map(struct pblk *pblk, sector_t lba, struct ppa_addr ppa)
void pblk_update_map_cache(struct pblk *pblk, sector_t lba, struct ppa_addr ppa)
{
#ifdef CONFIG_NVM_DEBUG
/* Callers must ensure that the ppa points to a cache address */
BUG_ON(!pblk_addr_in_cache(ppa));
@ -1793,16 +1787,16 @@ void pblk_update_map_cache(struct pblk *pblk, sector_t lba, struct ppa_addr ppa)
pblk_update_map(pblk, lba, ppa);
}
int pblk_update_map_gc(struct pblk *pblk, sector_t lba, struct ppa_addr ppa,
struct pblk_line *gc_line)
int pblk_update_map_gc(struct pblk *pblk, sector_t lba, struct ppa_addr ppa_new,
struct pblk_line *gc_line, u64 paddr_gc)
{
struct ppa_addr l2p_ppa;
struct ppa_addr ppa_l2p, ppa_gc;
int ret = 1;
#ifdef CONFIG_NVM_DEBUG
/* Callers must ensure that the ppa points to a cache address */
BUG_ON(!pblk_addr_in_cache(ppa));
BUG_ON(pblk_rb_pos_oob(&pblk->rwb, pblk_addr_to_cacheline(ppa)));
BUG_ON(!pblk_addr_in_cache(ppa_new));
BUG_ON(pblk_rb_pos_oob(&pblk->rwb, pblk_addr_to_cacheline(ppa_new)));
#endif
/* logic error: lba out-of-bounds. Ignore update */
@ -1812,36 +1806,41 @@ int pblk_update_map_gc(struct pblk *pblk, sector_t lba, struct ppa_addr ppa,
}
spin_lock(&pblk->trans_lock);
l2p_ppa = pblk_trans_map_get(pblk, lba);
ppa_l2p = pblk_trans_map_get(pblk, lba);
ppa_gc = addr_to_gen_ppa(pblk, paddr_gc, gc_line->id);
if (!pblk_ppa_comp(ppa_l2p, ppa_gc)) {
spin_lock(&gc_line->lock);
WARN(!test_bit(paddr_gc, gc_line->invalid_bitmap),
"pblk: corrupted GC update");
spin_unlock(&gc_line->lock);
/* Prevent updated entries to be overwritten by GC */
if (pblk_addr_in_cache(l2p_ppa) || pblk_ppa_empty(l2p_ppa) ||
pblk_tgt_ppa_to_line(l2p_ppa) != gc_line->id) {
ret = 0;
goto out;
}
pblk_trans_map_set(pblk, lba, ppa);
pblk_trans_map_set(pblk, lba, ppa_new);
out:
spin_unlock(&pblk->trans_lock);
return ret;
}
void pblk_update_map_dev(struct pblk *pblk, sector_t lba, struct ppa_addr ppa,
struct ppa_addr entry_line)
void pblk_update_map_dev(struct pblk *pblk, sector_t lba,
struct ppa_addr ppa_mapped, struct ppa_addr ppa_cache)
{
struct ppa_addr l2p_line;
struct ppa_addr ppa_l2p;
#ifdef CONFIG_NVM_DEBUG
/* Callers must ensure that the ppa points to a device address */
BUG_ON(pblk_addr_in_cache(ppa));
BUG_ON(pblk_addr_in_cache(ppa_mapped));
#endif
/* Invalidate and discard padded entries */
if (lba == ADDR_EMPTY) {
#ifdef CONFIG_NVM_DEBUG
atomic_long_inc(&pblk->padded_wb);
#endif
pblk_map_invalidate(pblk, ppa);
if (!pblk_ppa_empty(ppa_mapped))
pblk_map_invalidate(pblk, ppa_mapped);
return;
}
@ -1852,22 +1851,22 @@ void pblk_update_map_dev(struct pblk *pblk, sector_t lba, struct ppa_addr ppa,
}
spin_lock(&pblk->trans_lock);
l2p_line = pblk_trans_map_get(pblk, lba);
ppa_l2p = pblk_trans_map_get(pblk, lba);
/* Do not update L2P if the cacheline has been updated. In this case,
* the mapped ppa must be invalidated
*/
if (l2p_line.ppa != entry_line.ppa) {
if (!pblk_ppa_empty(ppa))
pblk_map_invalidate(pblk, ppa);
if (!pblk_ppa_comp(ppa_l2p, ppa_cache)) {
if (!pblk_ppa_empty(ppa_mapped))
pblk_map_invalidate(pblk, ppa_mapped);
goto out;
}
#ifdef CONFIG_NVM_DEBUG
WARN_ON(!pblk_addr_in_cache(l2p_line) && !pblk_ppa_empty(l2p_line));
WARN_ON(!pblk_addr_in_cache(ppa_l2p) && !pblk_ppa_empty(ppa_l2p));
#endif
pblk_trans_map_set(pblk, lba, ppa);
pblk_trans_map_set(pblk, lba, ppa_mapped);
out:
spin_unlock(&pblk->trans_lock);
}
@ -1878,23 +1877,32 @@ void pblk_lookup_l2p_seq(struct pblk *pblk, struct ppa_addr *ppas,
int i;
spin_lock(&pblk->trans_lock);
for (i = 0; i < nr_secs; i++)
ppas[i] = pblk_trans_map_get(pblk, blba + i);
for (i = 0; i < nr_secs; i++) {
struct ppa_addr ppa;
ppa = ppas[i] = pblk_trans_map_get(pblk, blba + i);
/* If the L2P entry maps to a line, the reference is valid */
if (!pblk_ppa_empty(ppa) && !pblk_addr_in_cache(ppa)) {
int line_id = pblk_dev_ppa_to_line(ppa);
struct pblk_line *line = &pblk->lines[line_id];
kref_get(&line->ref);
}
}
spin_unlock(&pblk->trans_lock);
}
void pblk_lookup_l2p_rand(struct pblk *pblk, struct ppa_addr *ppas,
u64 *lba_list, int nr_secs)
{
sector_t lba;
u64 lba;
int i;
spin_lock(&pblk->trans_lock);
for (i = 0; i < nr_secs; i++) {
lba = lba_list[i];
if (lba == ADDR_EMPTY) {
ppas[i].ppa = ADDR_EMPTY;
} else {
if (lba != ADDR_EMPTY) {
/* logic error: lba out-of-bounds. Ignore update */
if (!(lba < pblk->rl.nr_secs)) {
WARN(1, "pblk: corrupted L2P map request\n");

View File

@ -20,7 +20,8 @@
static void pblk_gc_free_gc_rq(struct pblk_gc_rq *gc_rq)
{
vfree(gc_rq->data);
if (gc_rq->data)
vfree(gc_rq->data);
kfree(gc_rq);
}
@ -41,10 +42,7 @@ static int pblk_gc_write(struct pblk *pblk)
spin_unlock(&gc->w_lock);
list_for_each_entry_safe(gc_rq, tgc_rq, &w_list, list) {
pblk_write_gc_to_cache(pblk, gc_rq->data, gc_rq->lba_list,
gc_rq->nr_secs, gc_rq->secs_to_gc,
gc_rq->line, PBLK_IOTYPE_GC);
pblk_write_gc_to_cache(pblk, gc_rq);
list_del(&gc_rq->list);
kref_put(&gc_rq->line->ref, pblk_line_put);
pblk_gc_free_gc_rq(gc_rq);
@ -58,64 +56,6 @@ static void pblk_gc_writer_kick(struct pblk_gc *gc)
wake_up_process(gc->gc_writer_ts);
}
/*
* Responsible for managing all memory related to a gc request. Also in case of
* failure
*/
static int pblk_gc_move_valid_secs(struct pblk *pblk, struct pblk_gc_rq *gc_rq)
{
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
struct pblk_gc *gc = &pblk->gc;
struct pblk_line *line = gc_rq->line;
void *data;
unsigned int secs_to_gc;
int ret = 0;
data = vmalloc(gc_rq->nr_secs * geo->sec_size);
if (!data) {
ret = -ENOMEM;
goto out;
}
/* Read from GC victim block */
if (pblk_submit_read_gc(pblk, gc_rq->lba_list, data, gc_rq->nr_secs,
&secs_to_gc, line)) {
ret = -EFAULT;
goto free_data;
}
if (!secs_to_gc)
goto free_rq;
gc_rq->data = data;
gc_rq->secs_to_gc = secs_to_gc;
retry:
spin_lock(&gc->w_lock);
if (gc->w_entries >= PBLK_GC_W_QD) {
spin_unlock(&gc->w_lock);
pblk_gc_writer_kick(&pblk->gc);
usleep_range(128, 256);
goto retry;
}
gc->w_entries++;
list_add_tail(&gc_rq->list, &gc->w_list);
spin_unlock(&gc->w_lock);
pblk_gc_writer_kick(&pblk->gc);
return 0;
free_rq:
kfree(gc_rq);
free_data:
vfree(data);
out:
kref_put(&line->ref, pblk_line_put);
return ret;
}
static void pblk_put_line_back(struct pblk *pblk, struct pblk_line *line)
{
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
@ -136,22 +76,57 @@ static void pblk_put_line_back(struct pblk *pblk, struct pblk_line *line)
static void pblk_gc_line_ws(struct work_struct *work)
{
struct pblk_line_ws *line_rq_ws = container_of(work,
struct pblk_line_ws *gc_rq_ws = container_of(work,
struct pblk_line_ws, ws);
struct pblk *pblk = line_rq_ws->pblk;
struct pblk *pblk = gc_rq_ws->pblk;
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
struct pblk_gc *gc = &pblk->gc;
struct pblk_line *line = line_rq_ws->line;
struct pblk_gc_rq *gc_rq = line_rq_ws->priv;
struct pblk_line *line = gc_rq_ws->line;
struct pblk_gc_rq *gc_rq = gc_rq_ws->priv;
int ret;
up(&gc->gc_sem);
if (pblk_gc_move_valid_secs(pblk, gc_rq)) {
pr_err("pblk: could not GC all sectors: line:%d (%d/%d)\n",
line->id, *line->vsc,
gc_rq->nr_secs);
gc_rq->data = vmalloc(gc_rq->nr_secs * geo->sec_size);
if (!gc_rq->data) {
pr_err("pblk: could not GC line:%d (%d/%d)\n",
line->id, *line->vsc, gc_rq->nr_secs);
goto out;
}
mempool_free(line_rq_ws, pblk->line_ws_pool);
/* Read from GC victim block */
ret = pblk_submit_read_gc(pblk, gc_rq);
if (ret) {
pr_err("pblk: failed GC read in line:%d (err:%d)\n",
line->id, ret);
goto out;
}
if (!gc_rq->secs_to_gc)
goto out;
retry:
spin_lock(&gc->w_lock);
if (gc->w_entries >= PBLK_GC_RQ_QD) {
spin_unlock(&gc->w_lock);
pblk_gc_writer_kick(&pblk->gc);
usleep_range(128, 256);
goto retry;
}
gc->w_entries++;
list_add_tail(&gc_rq->list, &gc->w_list);
spin_unlock(&gc->w_lock);
pblk_gc_writer_kick(&pblk->gc);
kfree(gc_rq_ws);
return;
out:
pblk_gc_free_gc_rq(gc_rq);
kref_put(&line->ref, pblk_line_put);
kfree(gc_rq_ws);
}
static void pblk_gc_line_prepare_ws(struct work_struct *work)
@ -164,17 +139,24 @@ static void pblk_gc_line_prepare_ws(struct work_struct *work)
struct pblk_line_meta *lm = &pblk->lm;
struct pblk_gc *gc = &pblk->gc;
struct line_emeta *emeta_buf;
struct pblk_line_ws *line_rq_ws;
struct pblk_line_ws *gc_rq_ws;
struct pblk_gc_rq *gc_rq;
__le64 *lba_list;
unsigned long *invalid_bitmap;
int sec_left, nr_secs, bit;
int ret;
invalid_bitmap = kmalloc(lm->sec_bitmap_len, GFP_KERNEL);
if (!invalid_bitmap) {
pr_err("pblk: could not allocate GC invalid bitmap\n");
goto fail_free_ws;
}
emeta_buf = pblk_malloc(lm->emeta_len[0], l_mg->emeta_alloc_type,
GFP_KERNEL);
if (!emeta_buf) {
pr_err("pblk: cannot use GC emeta\n");
return;
goto fail_free_bitmap;
}
ret = pblk_line_read_emeta(pblk, line, emeta_buf);
@ -193,7 +175,11 @@ static void pblk_gc_line_prepare_ws(struct work_struct *work)
goto fail_free_emeta;
}
spin_lock(&line->lock);
bitmap_copy(invalid_bitmap, line->invalid_bitmap, lm->sec_per_line);
sec_left = pblk_line_vsc(line);
spin_unlock(&line->lock);
if (sec_left < 0) {
pr_err("pblk: corrupted GC line (%d)\n", line->id);
goto fail_free_emeta;
@ -207,11 +193,12 @@ next_rq:
nr_secs = 0;
do {
bit = find_next_zero_bit(line->invalid_bitmap, lm->sec_per_line,
bit = find_next_zero_bit(invalid_bitmap, lm->sec_per_line,
bit + 1);
if (bit > line->emeta_ssec)
break;
gc_rq->paddr_list[nr_secs] = bit;
gc_rq->lba_list[nr_secs++] = le64_to_cpu(lba_list[bit]);
} while (nr_secs < pblk->max_write_pgs);
@ -223,19 +210,25 @@ next_rq:
gc_rq->nr_secs = nr_secs;
gc_rq->line = line;
line_rq_ws = mempool_alloc(pblk->line_ws_pool, GFP_KERNEL);
if (!line_rq_ws)
gc_rq_ws = kmalloc(sizeof(struct pblk_line_ws), GFP_KERNEL);
if (!gc_rq_ws)
goto fail_free_gc_rq;
line_rq_ws->pblk = pblk;
line_rq_ws->line = line;
line_rq_ws->priv = gc_rq;
gc_rq_ws->pblk = pblk;
gc_rq_ws->line = line;
gc_rq_ws->priv = gc_rq;
/* The write GC path can be much slower than the read GC one due to
* the budget imposed by the rate-limiter. Balance in case that we get
* back pressure from the write GC path.
*/
while (down_timeout(&gc->gc_sem, msecs_to_jiffies(30000)))
io_schedule();
down(&gc->gc_sem);
kref_get(&line->ref);
INIT_WORK(&line_rq_ws->ws, pblk_gc_line_ws);
queue_work(gc->gc_line_reader_wq, &line_rq_ws->ws);
INIT_WORK(&gc_rq_ws->ws, pblk_gc_line_ws);
queue_work(gc->gc_line_reader_wq, &gc_rq_ws->ws);
sec_left -= nr_secs;
if (sec_left > 0)
@ -243,10 +236,11 @@ next_rq:
out:
pblk_mfree(emeta_buf, l_mg->emeta_alloc_type);
mempool_free(line_ws, pblk->line_ws_pool);
kfree(line_ws);
kfree(invalid_bitmap);
kref_put(&line->ref, pblk_line_put);
atomic_dec(&gc->inflight_gc);
atomic_dec(&gc->read_inflight_gc);
return;
@ -254,10 +248,14 @@ fail_free_gc_rq:
kfree(gc_rq);
fail_free_emeta:
pblk_mfree(emeta_buf, l_mg->emeta_alloc_type);
fail_free_bitmap:
kfree(invalid_bitmap);
fail_free_ws:
kfree(line_ws);
pblk_put_line_back(pblk, line);
kref_put(&line->ref, pblk_line_put);
mempool_free(line_ws, pblk->line_ws_pool);
atomic_dec(&gc->inflight_gc);
atomic_dec(&gc->read_inflight_gc);
pr_err("pblk: Failed to GC line %d\n", line->id);
}
@ -269,19 +267,40 @@ static int pblk_gc_line(struct pblk *pblk, struct pblk_line *line)
pr_debug("pblk: line '%d' being reclaimed for GC\n", line->id);
line_ws = mempool_alloc(pblk->line_ws_pool, GFP_KERNEL);
line_ws = kmalloc(sizeof(struct pblk_line_ws), GFP_KERNEL);
if (!line_ws)
return -ENOMEM;
line_ws->pblk = pblk;
line_ws->line = line;
atomic_inc(&gc->pipeline_gc);
INIT_WORK(&line_ws->ws, pblk_gc_line_prepare_ws);
queue_work(gc->gc_reader_wq, &line_ws->ws);
return 0;
}
static void pblk_gc_reader_kick(struct pblk_gc *gc)
{
wake_up_process(gc->gc_reader_ts);
}
static void pblk_gc_kick(struct pblk *pblk)
{
struct pblk_gc *gc = &pblk->gc;
pblk_gc_writer_kick(gc);
pblk_gc_reader_kick(gc);
/* If we're shutting down GC, let's not start it up again */
if (gc->gc_enabled) {
wake_up_process(gc->gc_ts);
mod_timer(&gc->gc_timer,
jiffies + msecs_to_jiffies(GC_TIME_MSECS));
}
}
static int pblk_gc_read(struct pblk *pblk)
{
struct pblk_gc *gc = &pblk->gc;
@ -305,11 +324,6 @@ static int pblk_gc_read(struct pblk *pblk)
return 0;
}
static void pblk_gc_reader_kick(struct pblk_gc *gc)
{
wake_up_process(gc->gc_reader_ts);
}
static struct pblk_line *pblk_gc_get_victim_line(struct pblk *pblk,
struct list_head *group_list)
{
@ -338,26 +352,17 @@ static bool pblk_gc_should_run(struct pblk_gc *gc, struct pblk_rl *rl)
return ((gc->gc_active) && (nr_blocks_need > nr_blocks_free));
}
/*
* Lines with no valid sectors will be returned to the free list immediately. If
* GC is activated - either because the free block count is under the determined
* threshold, or because it is being forced from user space - only lines with a
* high count of invalid sectors will be recycled.
*/
static void pblk_gc_run(struct pblk *pblk)
void pblk_gc_free_full_lines(struct pblk *pblk)
{
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
struct pblk_gc *gc = &pblk->gc;
struct pblk_line *line;
struct list_head *group_list;
bool run_gc;
int inflight_gc, gc_group = 0, prev_group = 0;
do {
spin_lock(&l_mg->gc_lock);
if (list_empty(&l_mg->gc_full_list)) {
spin_unlock(&l_mg->gc_lock);
break;
return;
}
line = list_first_entry(&l_mg->gc_full_list,
@ -371,11 +376,30 @@ static void pblk_gc_run(struct pblk *pblk)
list_del(&line->list);
spin_unlock(&l_mg->gc_lock);
atomic_inc(&gc->pipeline_gc);
kref_put(&line->ref, pblk_line_put);
} while (1);
}
/*
* Lines with no valid sectors will be returned to the free list immediately. If
* GC is activated - either because the free block count is under the determined
* threshold, or because it is being forced from user space - only lines with a
* high count of invalid sectors will be recycled.
*/
static void pblk_gc_run(struct pblk *pblk)
{
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
struct pblk_gc *gc = &pblk->gc;
struct pblk_line *line;
struct list_head *group_list;
bool run_gc;
int read_inflight_gc, gc_group = 0, prev_group = 0;
pblk_gc_free_full_lines(pblk);
run_gc = pblk_gc_should_run(&pblk->gc, &pblk->rl);
if (!run_gc || (atomic_read(&gc->inflight_gc) >= PBLK_GC_L_QD))
if (!run_gc || (atomic_read(&gc->read_inflight_gc) >= PBLK_GC_L_QD))
return;
next_gc_group:
@ -402,14 +426,14 @@ next_gc_group:
list_add_tail(&line->list, &gc->r_list);
spin_unlock(&gc->r_lock);
inflight_gc = atomic_inc_return(&gc->inflight_gc);
read_inflight_gc = atomic_inc_return(&gc->read_inflight_gc);
pblk_gc_reader_kick(gc);
prev_group = 1;
/* No need to queue up more GC lines than we can handle */
run_gc = pblk_gc_should_run(&pblk->gc, &pblk->rl);
if (!run_gc || inflight_gc >= PBLK_GC_L_QD)
if (!run_gc || read_inflight_gc >= PBLK_GC_L_QD)
break;
} while (1);
@ -418,16 +442,6 @@ next_gc_group:
goto next_gc_group;
}
void pblk_gc_kick(struct pblk *pblk)
{
struct pblk_gc *gc = &pblk->gc;
wake_up_process(gc->gc_ts);
pblk_gc_writer_kick(gc);
pblk_gc_reader_kick(gc);
mod_timer(&gc->gc_timer, jiffies + msecs_to_jiffies(GC_TIME_MSECS));
}
static void pblk_gc_timer(unsigned long data)
{
struct pblk *pblk = (struct pblk *)data;
@ -465,6 +479,7 @@ static int pblk_gc_writer_ts(void *data)
static int pblk_gc_reader_ts(void *data)
{
struct pblk *pblk = data;
struct pblk_gc *gc = &pblk->gc;
while (!kthread_should_stop()) {
if (!pblk_gc_read(pblk))
@ -473,6 +488,18 @@ static int pblk_gc_reader_ts(void *data)
io_schedule();
}
#ifdef CONFIG_NVM_DEBUG
pr_info("pblk: flushing gc pipeline, %d lines left\n",
atomic_read(&gc->pipeline_gc));
#endif
do {
if (!atomic_read(&gc->pipeline_gc))
break;
schedule();
} while (1);
return 0;
}
@ -486,10 +513,10 @@ void pblk_gc_should_start(struct pblk *pblk)
{
struct pblk_gc *gc = &pblk->gc;
if (gc->gc_enabled && !gc->gc_active)
if (gc->gc_enabled && !gc->gc_active) {
pblk_gc_start(pblk);
pblk_gc_kick(pblk);
pblk_gc_kick(pblk);
}
}
/*
@ -510,6 +537,11 @@ void pblk_gc_should_stop(struct pblk *pblk)
pblk_gc_stop(pblk, 0);
}
void pblk_gc_should_kick(struct pblk *pblk)
{
pblk_rl_update_rates(&pblk->rl);
}
void pblk_gc_sysfs_state_show(struct pblk *pblk, int *gc_enabled,
int *gc_active)
{
@ -576,7 +608,8 @@ int pblk_gc_init(struct pblk *pblk)
gc->gc_forced = 0;
gc->gc_enabled = 1;
gc->w_entries = 0;
atomic_set(&gc->inflight_gc, 0);
atomic_set(&gc->read_inflight_gc, 0);
atomic_set(&gc->pipeline_gc, 0);
/* Workqueue that reads valid sectors from a line and submit them to the
* GC writer to be recycled.
@ -602,7 +635,7 @@ int pblk_gc_init(struct pblk *pblk)
spin_lock_init(&gc->w_lock);
spin_lock_init(&gc->r_lock);
sema_init(&gc->gc_sem, 128);
sema_init(&gc->gc_sem, PBLK_GC_RQ_QD);
INIT_LIST_HEAD(&gc->w_list);
INIT_LIST_HEAD(&gc->r_list);
@ -625,24 +658,24 @@ void pblk_gc_exit(struct pblk *pblk)
{
struct pblk_gc *gc = &pblk->gc;
flush_workqueue(gc->gc_reader_wq);
flush_workqueue(gc->gc_line_reader_wq);
del_timer(&gc->gc_timer);
gc->gc_enabled = 0;
del_timer_sync(&gc->gc_timer);
pblk_gc_stop(pblk, 1);
if (gc->gc_ts)
kthread_stop(gc->gc_ts);
if (gc->gc_reader_ts)
kthread_stop(gc->gc_reader_ts);
flush_workqueue(gc->gc_reader_wq);
if (gc->gc_reader_wq)
destroy_workqueue(gc->gc_reader_wq);
flush_workqueue(gc->gc_line_reader_wq);
if (gc->gc_line_reader_wq)
destroy_workqueue(gc->gc_line_reader_wq);
if (gc->gc_writer_ts)
kthread_stop(gc->gc_writer_ts);
if (gc->gc_reader_ts)
kthread_stop(gc->gc_reader_ts);
}

View File

@ -20,8 +20,8 @@
#include "pblk.h"
static struct kmem_cache *pblk_blk_ws_cache, *pblk_rec_cache, *pblk_g_rq_cache,
*pblk_w_rq_cache, *pblk_line_meta_cache;
static struct kmem_cache *pblk_ws_cache, *pblk_rec_cache, *pblk_g_rq_cache,
*pblk_w_rq_cache;
static DECLARE_RWSEM(pblk_lock);
struct bio_set *pblk_bio_set;
@ -46,7 +46,7 @@ static int pblk_rw_io(struct request_queue *q, struct pblk *pblk,
* user I/Os. Unless stalled, the rate limiter leaves at least 256KB
* available for user I/O.
*/
if (unlikely(pblk_get_secs(bio) >= pblk_rl_sysfs_rate_show(&pblk->rl)))
if (pblk_get_secs(bio) > pblk_rl_max_io(&pblk->rl))
blk_queue_split(q, &bio);
return pblk_write_to_cache(pblk, bio, PBLK_IOTYPE_USER);
@ -76,6 +76,28 @@ static blk_qc_t pblk_make_rq(struct request_queue *q, struct bio *bio)
return BLK_QC_T_NONE;
}
static size_t pblk_trans_map_size(struct pblk *pblk)
{
int entry_size = 8;
if (pblk->ppaf_bitsize < 32)
entry_size = 4;
return entry_size * pblk->rl.nr_secs;
}
#ifdef CONFIG_NVM_DEBUG
static u32 pblk_l2p_crc(struct pblk *pblk)
{
size_t map_size;
u32 crc = ~(u32)0;
map_size = pblk_trans_map_size(pblk);
crc = crc32_le(crc, pblk->trans_map, map_size);
return crc;
}
#endif
static void pblk_l2p_free(struct pblk *pblk)
{
vfree(pblk->trans_map);
@ -85,12 +107,10 @@ static int pblk_l2p_init(struct pblk *pblk)
{
sector_t i;
struct ppa_addr ppa;
int entry_size = 8;
size_t map_size;
if (pblk->ppaf_bitsize < 32)
entry_size = 4;
pblk->trans_map = vmalloc(entry_size * pblk->rl.nr_secs);
map_size = pblk_trans_map_size(pblk);
pblk->trans_map = vmalloc(map_size);
if (!pblk->trans_map)
return -ENOMEM;
@ -132,7 +152,6 @@ static int pblk_rwb_init(struct pblk *pblk)
}
/* Minimum pages needed within a lun */
#define PAGE_POOL_SIZE 16
#define ADDR_POOL_SIZE 64
static int pblk_set_ppaf(struct pblk *pblk)
@ -182,12 +201,10 @@ static int pblk_set_ppaf(struct pblk *pblk)
static int pblk_init_global_caches(struct pblk *pblk)
{
char cache_name[PBLK_CACHE_NAME_LEN];
down_write(&pblk_lock);
pblk_blk_ws_cache = kmem_cache_create("pblk_blk_ws",
pblk_ws_cache = kmem_cache_create("pblk_blk_ws",
sizeof(struct pblk_line_ws), 0, 0, NULL);
if (!pblk_blk_ws_cache) {
if (!pblk_ws_cache) {
up_write(&pblk_lock);
return -ENOMEM;
}
@ -195,7 +212,7 @@ static int pblk_init_global_caches(struct pblk *pblk)
pblk_rec_cache = kmem_cache_create("pblk_rec",
sizeof(struct pblk_rec_ctx), 0, 0, NULL);
if (!pblk_rec_cache) {
kmem_cache_destroy(pblk_blk_ws_cache);
kmem_cache_destroy(pblk_ws_cache);
up_write(&pblk_lock);
return -ENOMEM;
}
@ -203,7 +220,7 @@ static int pblk_init_global_caches(struct pblk *pblk)
pblk_g_rq_cache = kmem_cache_create("pblk_g_rq", pblk_g_rq_size,
0, 0, NULL);
if (!pblk_g_rq_cache) {
kmem_cache_destroy(pblk_blk_ws_cache);
kmem_cache_destroy(pblk_ws_cache);
kmem_cache_destroy(pblk_rec_cache);
up_write(&pblk_lock);
return -ENOMEM;
@ -212,30 +229,25 @@ static int pblk_init_global_caches(struct pblk *pblk)
pblk_w_rq_cache = kmem_cache_create("pblk_w_rq", pblk_w_rq_size,
0, 0, NULL);
if (!pblk_w_rq_cache) {
kmem_cache_destroy(pblk_blk_ws_cache);
kmem_cache_destroy(pblk_ws_cache);
kmem_cache_destroy(pblk_rec_cache);
kmem_cache_destroy(pblk_g_rq_cache);
up_write(&pblk_lock);
return -ENOMEM;
}
snprintf(cache_name, sizeof(cache_name), "pblk_line_m_%s",
pblk->disk->disk_name);
pblk_line_meta_cache = kmem_cache_create(cache_name,
pblk->lm.sec_bitmap_len, 0, 0, NULL);
if (!pblk_line_meta_cache) {
kmem_cache_destroy(pblk_blk_ws_cache);
kmem_cache_destroy(pblk_rec_cache);
kmem_cache_destroy(pblk_g_rq_cache);
kmem_cache_destroy(pblk_w_rq_cache);
up_write(&pblk_lock);
return -ENOMEM;
}
up_write(&pblk_lock);
return 0;
}
static void pblk_free_global_caches(struct pblk *pblk)
{
kmem_cache_destroy(pblk_ws_cache);
kmem_cache_destroy(pblk_rec_cache);
kmem_cache_destroy(pblk_g_rq_cache);
kmem_cache_destroy(pblk_w_rq_cache);
}
static int pblk_core_init(struct pblk *pblk)
{
struct nvm_tgt_dev *dev = pblk->dev;
@ -247,70 +259,80 @@ static int pblk_core_init(struct pblk *pblk)
if (pblk_init_global_caches(pblk))
return -ENOMEM;
pblk->page_pool = mempool_create_page_pool(PAGE_POOL_SIZE, 0);
if (!pblk->page_pool)
return -ENOMEM;
/* Internal bios can be at most the sectors signaled by the device. */
pblk->page_bio_pool = mempool_create_page_pool(nvm_max_phys_sects(dev),
0);
if (!pblk->page_bio_pool)
goto free_global_caches;
pblk->line_ws_pool = mempool_create_slab_pool(PBLK_WS_POOL_SIZE,
pblk_blk_ws_cache);
if (!pblk->line_ws_pool)
goto free_page_pool;
pblk->gen_ws_pool = mempool_create_slab_pool(PBLK_GEN_WS_POOL_SIZE,
pblk_ws_cache);
if (!pblk->gen_ws_pool)
goto free_page_bio_pool;
pblk->rec_pool = mempool_create_slab_pool(geo->nr_luns, pblk_rec_cache);
if (!pblk->rec_pool)
goto free_blk_ws_pool;
goto free_gen_ws_pool;
pblk->g_rq_pool = mempool_create_slab_pool(PBLK_READ_REQ_POOL_SIZE,
pblk->r_rq_pool = mempool_create_slab_pool(geo->nr_luns,
pblk_g_rq_cache);
if (!pblk->g_rq_pool)
if (!pblk->r_rq_pool)
goto free_rec_pool;
pblk->w_rq_pool = mempool_create_slab_pool(geo->nr_luns * 2,
pblk->e_rq_pool = mempool_create_slab_pool(geo->nr_luns,
pblk_g_rq_cache);
if (!pblk->e_rq_pool)
goto free_r_rq_pool;
pblk->w_rq_pool = mempool_create_slab_pool(geo->nr_luns,
pblk_w_rq_cache);
if (!pblk->w_rq_pool)
goto free_g_rq_pool;
pblk->line_meta_pool =
mempool_create_slab_pool(PBLK_META_POOL_SIZE,
pblk_line_meta_cache);
if (!pblk->line_meta_pool)
goto free_w_rq_pool;
goto free_e_rq_pool;
pblk->close_wq = alloc_workqueue("pblk-close-wq",
WQ_MEM_RECLAIM | WQ_UNBOUND, PBLK_NR_CLOSE_JOBS);
if (!pblk->close_wq)
goto free_line_meta_pool;
goto free_w_rq_pool;
pblk->bb_wq = alloc_workqueue("pblk-bb-wq",
WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
if (!pblk->bb_wq)
goto free_close_wq;
if (pblk_set_ppaf(pblk))
pblk->r_end_wq = alloc_workqueue("pblk-read-end-wq",
WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
if (!pblk->r_end_wq)
goto free_bb_wq;
if (pblk_set_ppaf(pblk))
goto free_r_end_wq;
if (pblk_rwb_init(pblk))
goto free_bb_wq;
goto free_r_end_wq;
INIT_LIST_HEAD(&pblk->compl_list);
return 0;
free_r_end_wq:
destroy_workqueue(pblk->r_end_wq);
free_bb_wq:
destroy_workqueue(pblk->bb_wq);
free_close_wq:
destroy_workqueue(pblk->close_wq);
free_line_meta_pool:
mempool_destroy(pblk->line_meta_pool);
free_w_rq_pool:
mempool_destroy(pblk->w_rq_pool);
free_g_rq_pool:
mempool_destroy(pblk->g_rq_pool);
free_e_rq_pool:
mempool_destroy(pblk->e_rq_pool);
free_r_rq_pool:
mempool_destroy(pblk->r_rq_pool);
free_rec_pool:
mempool_destroy(pblk->rec_pool);
free_blk_ws_pool:
mempool_destroy(pblk->line_ws_pool);
free_page_pool:
mempool_destroy(pblk->page_pool);
free_gen_ws_pool:
mempool_destroy(pblk->gen_ws_pool);
free_page_bio_pool:
mempool_destroy(pblk->page_bio_pool);
free_global_caches:
pblk_free_global_caches(pblk);
return -ENOMEM;
}
@ -319,21 +341,20 @@ static void pblk_core_free(struct pblk *pblk)
if (pblk->close_wq)
destroy_workqueue(pblk->close_wq);
if (pblk->r_end_wq)
destroy_workqueue(pblk->r_end_wq);
if (pblk->bb_wq)
destroy_workqueue(pblk->bb_wq);
mempool_destroy(pblk->page_pool);
mempool_destroy(pblk->line_ws_pool);
mempool_destroy(pblk->page_bio_pool);
mempool_destroy(pblk->gen_ws_pool);
mempool_destroy(pblk->rec_pool);
mempool_destroy(pblk->g_rq_pool);
mempool_destroy(pblk->r_rq_pool);
mempool_destroy(pblk->e_rq_pool);
mempool_destroy(pblk->w_rq_pool);
mempool_destroy(pblk->line_meta_pool);
kmem_cache_destroy(pblk_blk_ws_cache);
kmem_cache_destroy(pblk_rec_cache);
kmem_cache_destroy(pblk_g_rq_cache);
kmem_cache_destroy(pblk_w_rq_cache);
kmem_cache_destroy(pblk_line_meta_cache);
pblk_free_global_caches(pblk);
}
static void pblk_luns_free(struct pblk *pblk)
@ -372,13 +393,11 @@ static void pblk_line_meta_free(struct pblk *pblk)
kfree(l_mg->bb_aux);
kfree(l_mg->vsc_list);
spin_lock(&l_mg->free_lock);
for (i = 0; i < PBLK_DATA_LINES; i++) {
kfree(l_mg->sline_meta[i]);
pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type);
kfree(l_mg->eline_meta[i]);
}
spin_unlock(&l_mg->free_lock);
kfree(pblk->lines);
}
@ -507,6 +526,13 @@ static int pblk_lines_configure(struct pblk *pblk, int flags)
}
}
#ifdef CONFIG_NVM_DEBUG
pr_info("pblk init: L2P CRC: %x\n", pblk_l2p_crc(pblk));
#endif
/* Free full lines directly as GC has not been started yet */
pblk_gc_free_full_lines(pblk);
if (!line) {
/* Configure next line for user data */
line = pblk_line_get_first_data(pblk);
@ -630,7 +656,10 @@ static int pblk_lines_alloc_metadata(struct pblk *pblk)
fail_free_emeta:
while (--i >= 0) {
vfree(l_mg->eline_meta[i]->buf);
if (l_mg->emeta_alloc_type == PBLK_VMALLOC_META)
vfree(l_mg->eline_meta[i]->buf);
else
kfree(l_mg->eline_meta[i]->buf);
kfree(l_mg->eline_meta[i]);
}
@ -681,8 +710,8 @@ static int pblk_lines_init(struct pblk *pblk)
lm->blk_bitmap_len = BITS_TO_LONGS(geo->nr_luns) * sizeof(long);
lm->sec_bitmap_len = BITS_TO_LONGS(lm->sec_per_line) * sizeof(long);
lm->lun_bitmap_len = BITS_TO_LONGS(geo->nr_luns) * sizeof(long);
lm->high_thrs = lm->sec_per_line / 2;
lm->mid_thrs = lm->sec_per_line / 4;
lm->mid_thrs = lm->sec_per_line / 2;
lm->high_thrs = lm->sec_per_line / 4;
lm->meta_distance = (geo->nr_luns / 2) * pblk->min_write_pgs;
/* Calculate necessary pages for smeta. See comment over struct
@ -713,9 +742,13 @@ add_emeta_page:
goto add_emeta_page;
}
lm->emeta_bb = geo->nr_luns - i;
lm->min_blk_line = 1 + DIV_ROUND_UP(lm->smeta_sec + lm->emeta_sec[0],
geo->sec_per_blk);
lm->emeta_bb = geo->nr_luns > i ? geo->nr_luns - i : 0;
lm->min_blk_line = 1;
if (geo->nr_luns > 1)
lm->min_blk_line += DIV_ROUND_UP(lm->smeta_sec +
lm->emeta_sec[0], geo->sec_per_blk);
if (lm->min_blk_line > lm->blk_per_line) {
pr_err("pblk: config. not supported. Min. LUN in line:%d\n",
lm->blk_per_line);
@ -890,6 +923,11 @@ static void pblk_exit(void *private)
down_write(&pblk_lock);
pblk_gc_exit(pblk);
pblk_tear_down(pblk);
#ifdef CONFIG_NVM_DEBUG
pr_info("pblk exit: L2P CRC: %x\n", pblk_l2p_crc(pblk));
#endif
pblk_free(pblk);
up_write(&pblk_lock);
}
@ -911,7 +949,7 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk,
int ret;
if (dev->identity.dom & NVM_RSP_L2P) {
pr_err("pblk: device-side L2P table not supported. (%x)\n",
pr_err("pblk: host-side L2P table not supported. (%x)\n",
dev->identity.dom);
return ERR_PTR(-EINVAL);
}
@ -923,6 +961,7 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk,
pblk->dev = dev;
pblk->disk = tdisk;
pblk->state = PBLK_STATE_RUNNING;
pblk->gc.gc_enabled = 0;
spin_lock_init(&pblk->trans_lock);
spin_lock_init(&pblk->lock);
@ -944,6 +983,7 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk,
atomic_long_set(&pblk->recov_writes, 0);
atomic_long_set(&pblk->recov_writes, 0);
atomic_long_set(&pblk->recov_gc_writes, 0);
atomic_long_set(&pblk->recov_gc_reads, 0);
#endif
atomic_long_set(&pblk->read_failed, 0);
@ -1012,6 +1052,10 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk,
pblk->rwb.nr_entries);
wake_up_process(pblk->writer_ts);
/* Check if we need to start GC */
pblk_gc_should_kick(pblk);
return pblk;
fail_stop_writer:
@ -1044,6 +1088,7 @@ static struct nvm_tgt_type tt_pblk = {
.sysfs_init = pblk_sysfs_init,
.sysfs_exit = pblk_sysfs_exit,
.owner = THIS_MODULE,
};
static int __init pblk_module_init(void)

View File

@ -25,16 +25,28 @@ static void pblk_map_page_data(struct pblk *pblk, unsigned int sentry,
unsigned int valid_secs)
{
struct pblk_line *line = pblk_line_get_data(pblk);
struct pblk_emeta *emeta = line->emeta;
struct pblk_emeta *emeta;
struct pblk_w_ctx *w_ctx;
__le64 *lba_list = emeta_to_lbas(pblk, emeta->buf);
__le64 *lba_list;
u64 paddr;
int nr_secs = pblk->min_write_pgs;
int i;
if (pblk_line_is_full(line)) {
struct pblk_line *prev_line = line;
line = pblk_line_replace_data(pblk);
pblk_line_close_meta(pblk, prev_line);
}
emeta = line->emeta;
lba_list = emeta_to_lbas(pblk, emeta->buf);
paddr = pblk_alloc_page(pblk, line, nr_secs);
for (i = 0; i < nr_secs; i++, paddr++) {
__le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
/* ppa to be sent to the device */
ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line->id);
@ -51,22 +63,14 @@ static void pblk_map_page_data(struct pblk *pblk, unsigned int sentry,
w_ctx->ppa = ppa_list[i];
meta_list[i].lba = cpu_to_le64(w_ctx->lba);
lba_list[paddr] = cpu_to_le64(w_ctx->lba);
line->nr_valid_lbas++;
if (lba_list[paddr] != addr_empty)
line->nr_valid_lbas++;
} else {
__le64 addr_empty = cpu_to_le64(ADDR_EMPTY);
lba_list[paddr] = meta_list[i].lba = addr_empty;
__pblk_map_invalidate(pblk, line, paddr);
}
}
if (pblk_line_is_full(line)) {
struct pblk_line *prev_line = line;
pblk_line_replace_data(pblk);
pblk_line_close_meta(pblk, prev_line);
}
pblk_down_rq(pblk, ppa_list, nr_secs, lun_bitmap);
}

View File

@ -201,8 +201,7 @@ unsigned int pblk_rb_read_commit(struct pblk_rb *rb, unsigned int nr_entries)
return subm;
}
static int __pblk_rb_update_l2p(struct pblk_rb *rb, unsigned int *l2p_upd,
unsigned int to_update)
static int __pblk_rb_update_l2p(struct pblk_rb *rb, unsigned int to_update)
{
struct pblk *pblk = container_of(rb, struct pblk, rwb);
struct pblk_line *line;
@ -213,7 +212,7 @@ static int __pblk_rb_update_l2p(struct pblk_rb *rb, unsigned int *l2p_upd,
int flags;
for (i = 0; i < to_update; i++) {
entry = &rb->entries[*l2p_upd];
entry = &rb->entries[rb->l2p_update];
w_ctx = &entry->w_ctx;
flags = READ_ONCE(entry->w_ctx.flags);
@ -230,7 +229,7 @@ static int __pblk_rb_update_l2p(struct pblk_rb *rb, unsigned int *l2p_upd,
line = &pblk->lines[pblk_tgt_ppa_to_line(w_ctx->ppa)];
kref_put(&line->ref, pblk_line_put);
clean_wctx(w_ctx);
*l2p_upd = (*l2p_upd + 1) & (rb->nr_entries - 1);
rb->l2p_update = (rb->l2p_update + 1) & (rb->nr_entries - 1);
}
pblk_rl_out(&pblk->rl, user_io, gc_io);
@ -258,7 +257,7 @@ static int pblk_rb_update_l2p(struct pblk_rb *rb, unsigned int nr_entries,
count = nr_entries - space;
/* l2p_update used exclusively under rb->w_lock */
ret = __pblk_rb_update_l2p(rb, &rb->l2p_update, count);
ret = __pblk_rb_update_l2p(rb, count);
out:
return ret;
@ -280,7 +279,7 @@ void pblk_rb_sync_l2p(struct pblk_rb *rb)
sync = smp_load_acquire(&rb->sync);
to_update = pblk_rb_ring_count(sync, rb->l2p_update, rb->nr_entries);
__pblk_rb_update_l2p(rb, &rb->l2p_update, to_update);
__pblk_rb_update_l2p(rb, to_update);
spin_unlock(&rb->w_lock);
}
@ -325,8 +324,8 @@ void pblk_rb_write_entry_user(struct pblk_rb *rb, void *data,
}
void pblk_rb_write_entry_gc(struct pblk_rb *rb, void *data,
struct pblk_w_ctx w_ctx, struct pblk_line *gc_line,
unsigned int ring_pos)
struct pblk_w_ctx w_ctx, struct pblk_line *line,
u64 paddr, unsigned int ring_pos)
{
struct pblk *pblk = container_of(rb, struct pblk, rwb);
struct pblk_rb_entry *entry;
@ -341,7 +340,7 @@ void pblk_rb_write_entry_gc(struct pblk_rb *rb, void *data,
__pblk_rb_write_entry(rb, data, w_ctx, entry);
if (!pblk_update_map_gc(pblk, w_ctx.lba, entry->cacheline, gc_line))
if (!pblk_update_map_gc(pblk, w_ctx.lba, entry->cacheline, line, paddr))
entry->w_ctx.lba = ADDR_EMPTY;
flags = w_ctx.flags | PBLK_WRITTEN_DATA;
@ -355,7 +354,6 @@ static int pblk_rb_sync_point_set(struct pblk_rb *rb, struct bio *bio,
{
struct pblk_rb_entry *entry;
unsigned int subm, sync_point;
int flags;
subm = READ_ONCE(rb->subm);
@ -369,12 +367,6 @@ static int pblk_rb_sync_point_set(struct pblk_rb *rb, struct bio *bio,
sync_point = (pos == 0) ? (rb->nr_entries - 1) : (pos - 1);
entry = &rb->entries[sync_point];
flags = READ_ONCE(entry->w_ctx.flags);
flags |= PBLK_FLUSH_ENTRY;
/* Release flags on context. Protect from writes */
smp_store_release(&entry->w_ctx.flags, flags);
/* Protect syncs */
smp_store_release(&rb->sync_point, sync_point);
@ -454,6 +446,7 @@ static int pblk_rb_may_write_flush(struct pblk_rb *rb, unsigned int nr_entries,
/* Protect from read count */
smp_store_release(&rb->mem, mem);
return 1;
}
@ -558,12 +551,13 @@ out:
* persist data on the write buffer to the media.
*/
unsigned int pblk_rb_read_to_bio(struct pblk_rb *rb, struct nvm_rq *rqd,
struct bio *bio, unsigned int pos,
unsigned int nr_entries, unsigned int count)
unsigned int pos, unsigned int nr_entries,
unsigned int count)
{
struct pblk *pblk = container_of(rb, struct pblk, rwb);
struct request_queue *q = pblk->dev->q;
struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
struct bio *bio = rqd->bio;
struct pblk_rb_entry *entry;
struct page *page;
unsigned int pad = 0, to_read = nr_entries;

View File

@ -39,21 +39,15 @@ static int pblk_read_from_cache(struct pblk *pblk, struct bio *bio,
}
static void pblk_read_ppalist_rq(struct pblk *pblk, struct nvm_rq *rqd,
unsigned long *read_bitmap)
sector_t blba, unsigned long *read_bitmap)
{
struct pblk_sec_meta *meta_list = rqd->meta_list;
struct bio *bio = rqd->bio;
struct ppa_addr ppas[PBLK_MAX_REQ_ADDRS];
sector_t blba = pblk_get_lba(bio);
int nr_secs = rqd->nr_ppas;
bool advanced_bio = false;
int i, j = 0;
/* logic error: lba out-of-bounds. Ignore read request */
if (blba + nr_secs >= pblk->rl.nr_secs) {
WARN(1, "pblk: read lbas out of bounds\n");
return;
}
pblk_lookup_l2p_seq(pblk, ppas, blba, nr_secs);
for (i = 0; i < nr_secs; i++) {
@ -63,6 +57,7 @@ static void pblk_read_ppalist_rq(struct pblk *pblk, struct nvm_rq *rqd,
retry:
if (pblk_ppa_empty(p)) {
WARN_ON(test_and_set_bit(i, read_bitmap));
meta_list[i].lba = cpu_to_le64(ADDR_EMPTY);
if (unlikely(!advanced_bio)) {
bio_advance(bio, (i) * PBLK_EXPOSED_PAGE_SIZE);
@ -82,6 +77,7 @@ retry:
goto retry;
}
WARN_ON(test_and_set_bit(i, read_bitmap));
meta_list[i].lba = cpu_to_le64(lba);
advanced_bio = true;
#ifdef CONFIG_NVM_DEBUG
atomic_long_inc(&pblk->cache_reads);
@ -117,10 +113,51 @@ static int pblk_submit_read_io(struct pblk *pblk, struct nvm_rq *rqd)
return NVM_IO_OK;
}
static void pblk_end_io_read(struct nvm_rq *rqd)
static void pblk_read_check(struct pblk *pblk, struct nvm_rq *rqd,
sector_t blba)
{
struct pblk_sec_meta *meta_list = rqd->meta_list;
int nr_lbas = rqd->nr_ppas;
int i;
for (i = 0; i < nr_lbas; i++) {
u64 lba = le64_to_cpu(meta_list[i].lba);
if (lba == ADDR_EMPTY)
continue;
WARN(lba != blba + i, "pblk: corrupted read LBA\n");
}
}
static void pblk_read_put_rqd_kref(struct pblk *pblk, struct nvm_rq *rqd)
{
struct ppa_addr *ppa_list;
int i;
ppa_list = (rqd->nr_ppas > 1) ? rqd->ppa_list : &rqd->ppa_addr;
for (i = 0; i < rqd->nr_ppas; i++) {
struct ppa_addr ppa = ppa_list[i];
struct pblk_line *line;
line = &pblk->lines[pblk_dev_ppa_to_line(ppa)];
kref_put(&line->ref, pblk_line_put_wq);
}
}
static void pblk_end_user_read(struct bio *bio)
{
#ifdef CONFIG_NVM_DEBUG
WARN_ONCE(bio->bi_status, "pblk: corrupted read bio\n");
#endif
bio_endio(bio);
bio_put(bio);
}
static void __pblk_end_io_read(struct pblk *pblk, struct nvm_rq *rqd,
bool put_line)
{
struct pblk *pblk = rqd->private;
struct nvm_tgt_dev *dev = pblk->dev;
struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
struct bio *bio = rqd->bio;
@ -131,47 +168,51 @@ static void pblk_end_io_read(struct nvm_rq *rqd)
WARN_ONCE(bio->bi_status, "pblk: corrupted read error\n");
#endif
nvm_dev_dma_free(dev->parent, rqd->meta_list, rqd->dma_meta_list);
pblk_read_check(pblk, rqd, r_ctx->lba);
bio_put(bio);
if (r_ctx->private) {
struct bio *orig_bio = r_ctx->private;
if (r_ctx->private)
pblk_end_user_read((struct bio *)r_ctx->private);
#ifdef CONFIG_NVM_DEBUG
WARN_ONCE(orig_bio->bi_status, "pblk: corrupted read bio\n");
#endif
bio_endio(orig_bio);
bio_put(orig_bio);
}
if (put_line)
pblk_read_put_rqd_kref(pblk, rqd);
#ifdef CONFIG_NVM_DEBUG
atomic_long_add(rqd->nr_ppas, &pblk->sync_reads);
atomic_long_sub(rqd->nr_ppas, &pblk->inflight_reads);
#endif
pblk_free_rqd(pblk, rqd, READ);
pblk_free_rqd(pblk, rqd, PBLK_READ);
atomic_dec(&pblk->inflight_io);
}
static void pblk_end_io_read(struct nvm_rq *rqd)
{
struct pblk *pblk = rqd->private;
__pblk_end_io_read(pblk, rqd, true);
}
static int pblk_fill_partial_read_bio(struct pblk *pblk, struct nvm_rq *rqd,
unsigned int bio_init_idx,
unsigned long *read_bitmap)
{
struct bio *new_bio, *bio = rqd->bio;
struct pblk_sec_meta *meta_list = rqd->meta_list;
struct bio_vec src_bv, dst_bv;
void *ppa_ptr = NULL;
void *src_p, *dst_p;
dma_addr_t dma_ppa_list = 0;
__le64 *lba_list_mem, *lba_list_media;
int nr_secs = rqd->nr_ppas;
int nr_holes = nr_secs - bitmap_weight(read_bitmap, nr_secs);
int i, ret, hole;
DECLARE_COMPLETION_ONSTACK(wait);
/* Re-use allocated memory for intermediate lbas */
lba_list_mem = (((void *)rqd->ppa_list) + pblk_dma_ppa_size);
lba_list_media = (((void *)rqd->ppa_list) + 2 * pblk_dma_ppa_size);
new_bio = bio_alloc(GFP_KERNEL, nr_holes);
if (!new_bio) {
pr_err("pblk: could not alloc read bio\n");
return NVM_IO_ERR;
}
if (pblk_bio_add_pages(pblk, new_bio, GFP_KERNEL, nr_holes))
goto err;
@ -181,34 +222,29 @@ static int pblk_fill_partial_read_bio(struct pblk *pblk, struct nvm_rq *rqd,
goto err;
}
for (i = 0; i < nr_secs; i++)
lba_list_mem[i] = meta_list[i].lba;
new_bio->bi_iter.bi_sector = 0; /* internal bio */
bio_set_op_attrs(new_bio, REQ_OP_READ, 0);
new_bio->bi_private = &wait;
new_bio->bi_end_io = pblk_end_bio_sync;
rqd->bio = new_bio;
rqd->nr_ppas = nr_holes;
rqd->flags = pblk_set_read_mode(pblk, PBLK_READ_RANDOM);
rqd->end_io = NULL;
if (unlikely(nr_secs > 1 && nr_holes == 1)) {
if (unlikely(nr_holes == 1)) {
ppa_ptr = rqd->ppa_list;
dma_ppa_list = rqd->dma_ppa_list;
rqd->ppa_addr = rqd->ppa_list[0];
}
ret = pblk_submit_read_io(pblk, rqd);
ret = pblk_submit_io_sync(pblk, rqd);
if (ret) {
bio_put(rqd->bio);
pr_err("pblk: read IO submission failed\n");
pr_err("pblk: sync read IO submission failed\n");
goto err;
}
if (!wait_for_completion_io_timeout(&wait,
msecs_to_jiffies(PBLK_COMMAND_TIMEOUT_MS))) {
pr_err("pblk: partial read I/O timed out\n");
}
if (rqd->error) {
atomic_long_inc(&pblk->read_failed);
#ifdef CONFIG_NVM_DEBUG
@ -216,15 +252,31 @@ static int pblk_fill_partial_read_bio(struct pblk *pblk, struct nvm_rq *rqd,
#endif
}
if (unlikely(nr_secs > 1 && nr_holes == 1)) {
if (unlikely(nr_holes == 1)) {
struct ppa_addr ppa;
ppa = rqd->ppa_addr;
rqd->ppa_list = ppa_ptr;
rqd->dma_ppa_list = dma_ppa_list;
rqd->ppa_list[0] = ppa;
}
for (i = 0; i < nr_secs; i++) {
lba_list_media[i] = meta_list[i].lba;
meta_list[i].lba = lba_list_mem[i];
}
/* Fill the holes in the original bio */
i = 0;
hole = find_first_zero_bit(read_bitmap, nr_secs);
do {
int line_id = pblk_dev_ppa_to_line(rqd->ppa_list[i]);
struct pblk_line *line = &pblk->lines[line_id];
kref_put(&line->ref, pblk_line_put);
meta_list[hole].lba = lba_list_media[i];
src_bv = new_bio->bi_io_vec[i++];
dst_bv = bio->bi_io_vec[bio_init_idx + hole];
@ -238,7 +290,7 @@ static int pblk_fill_partial_read_bio(struct pblk *pblk, struct nvm_rq *rqd,
kunmap_atomic(src_p);
kunmap_atomic(dst_p);
mempool_free(src_bv.bv_page, pblk->page_pool);
mempool_free(src_bv.bv_page, pblk->page_bio_pool);
hole = find_next_zero_bit(read_bitmap, nr_secs, hole + 1);
} while (hole < nr_secs);
@ -246,34 +298,26 @@ static int pblk_fill_partial_read_bio(struct pblk *pblk, struct nvm_rq *rqd,
bio_put(new_bio);
/* Complete the original bio and associated request */
bio_endio(bio);
rqd->bio = bio;
rqd->nr_ppas = nr_secs;
rqd->private = pblk;
bio_endio(bio);
pblk_end_io_read(rqd);
__pblk_end_io_read(pblk, rqd, false);
return NVM_IO_OK;
err:
/* Free allocated pages in new bio */
pblk_bio_free_pages(pblk, bio, 0, new_bio->bi_vcnt);
rqd->private = pblk;
pblk_end_io_read(rqd);
__pblk_end_io_read(pblk, rqd, false);
return NVM_IO_ERR;
}
static void pblk_read_rq(struct pblk *pblk, struct nvm_rq *rqd,
unsigned long *read_bitmap)
sector_t lba, unsigned long *read_bitmap)
{
struct pblk_sec_meta *meta_list = rqd->meta_list;
struct bio *bio = rqd->bio;
struct ppa_addr ppa;
sector_t lba = pblk_get_lba(bio);
/* logic error: lba out-of-bounds. Ignore read request */
if (lba >= pblk->rl.nr_secs) {
WARN(1, "pblk: read lba out of bounds\n");
return;
}
pblk_lookup_l2p_seq(pblk, &ppa, lba, 1);
@ -284,6 +328,7 @@ static void pblk_read_rq(struct pblk *pblk, struct nvm_rq *rqd,
retry:
if (pblk_ppa_empty(ppa)) {
WARN_ON(test_and_set_bit(0, read_bitmap));
meta_list[0].lba = cpu_to_le64(ADDR_EMPTY);
return;
}
@ -295,9 +340,12 @@ retry:
pblk_lookup_l2p_seq(pblk, &ppa, lba, 1);
goto retry;
}
WARN_ON(test_and_set_bit(0, read_bitmap));
meta_list[0].lba = cpu_to_le64(lba);
#ifdef CONFIG_NVM_DEBUG
atomic_long_inc(&pblk->cache_reads);
atomic_long_inc(&pblk->cache_reads);
#endif
} else {
rqd->ppa_addr = ppa;
@ -309,22 +357,24 @@ retry:
int pblk_submit_read(struct pblk *pblk, struct bio *bio)
{
struct nvm_tgt_dev *dev = pblk->dev;
sector_t blba = pblk_get_lba(bio);
unsigned int nr_secs = pblk_get_secs(bio);
struct pblk_g_ctx *r_ctx;
struct nvm_rq *rqd;
unsigned long read_bitmap; /* Max 64 ppas per request */
unsigned int bio_init_idx;
unsigned long read_bitmap; /* Max 64 ppas per request */
int ret = NVM_IO_ERR;
if (nr_secs > PBLK_MAX_REQ_ADDRS)
/* logic error: lba out-of-bounds. Ignore read request */
if (blba >= pblk->rl.nr_secs || nr_secs > PBLK_MAX_REQ_ADDRS) {
WARN(1, "pblk: read lba out of bounds (lba:%llu, nr:%d)\n",
(unsigned long long)blba, nr_secs);
return NVM_IO_ERR;
}
bitmap_zero(&read_bitmap, nr_secs);
rqd = pblk_alloc_rqd(pblk, READ);
if (IS_ERR(rqd)) {
pr_err_ratelimited("pblk: not able to alloc rqd");
return NVM_IO_ERR;
}
rqd = pblk_alloc_rqd(pblk, PBLK_READ);
rqd->opcode = NVM_OP_PREAD;
rqd->bio = bio;
@ -332,6 +382,9 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
rqd->private = pblk;
rqd->end_io = pblk_end_io_read;
r_ctx = nvm_rq_to_pdu(rqd);
r_ctx->lba = blba;
/* Save the index for this bio's start. This is needed in case
* we need to fill a partial read.
*/
@ -348,23 +401,22 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
rqd->ppa_list = rqd->meta_list + pblk_dma_meta_size;
rqd->dma_ppa_list = rqd->dma_meta_list + pblk_dma_meta_size;
pblk_read_ppalist_rq(pblk, rqd, &read_bitmap);
pblk_read_ppalist_rq(pblk, rqd, blba, &read_bitmap);
} else {
pblk_read_rq(pblk, rqd, &read_bitmap);
pblk_read_rq(pblk, rqd, blba, &read_bitmap);
}
bio_get(bio);
if (bitmap_full(&read_bitmap, nr_secs)) {
bio_endio(bio);
atomic_inc(&pblk->inflight_io);
pblk_end_io_read(rqd);
__pblk_end_io_read(pblk, rqd, false);
return NVM_IO_OK;
}
/* All sectors are to be read from the device */
if (bitmap_empty(&read_bitmap, rqd->nr_ppas)) {
struct bio *int_bio = NULL;
struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
/* Clone read bio to deal with read errors internally */
int_bio = bio_clone_fast(bio, GFP_KERNEL, pblk_bio_set);
@ -399,40 +451,46 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
return NVM_IO_OK;
fail_rqd_free:
pblk_free_rqd(pblk, rqd, READ);
pblk_free_rqd(pblk, rqd, PBLK_READ);
return ret;
}
static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
struct pblk_line *line, u64 *lba_list,
unsigned int nr_secs)
u64 *paddr_list_gc, unsigned int nr_secs)
{
struct ppa_addr ppas[PBLK_MAX_REQ_ADDRS];
struct ppa_addr ppa_list_l2p[PBLK_MAX_REQ_ADDRS];
struct ppa_addr ppa_gc;
int valid_secs = 0;
int i;
pblk_lookup_l2p_rand(pblk, ppas, lba_list, nr_secs);
pblk_lookup_l2p_rand(pblk, ppa_list_l2p, lba_list, nr_secs);
for (i = 0; i < nr_secs; i++) {
if (pblk_addr_in_cache(ppas[i]) || ppas[i].g.blk != line->id ||
pblk_ppa_empty(ppas[i])) {
lba_list[i] = ADDR_EMPTY;
if (lba_list[i] == ADDR_EMPTY)
continue;
ppa_gc = addr_to_gen_ppa(pblk, paddr_list_gc[i], line->id);
if (!pblk_ppa_comp(ppa_list_l2p[i], ppa_gc)) {
paddr_list_gc[i] = lba_list[i] = ADDR_EMPTY;
continue;
}
rqd->ppa_list[valid_secs++] = ppas[i];
rqd->ppa_list[valid_secs++] = ppa_list_l2p[i];
}
#ifdef CONFIG_NVM_DEBUG
atomic_long_add(valid_secs, &pblk->inflight_reads);
#endif
return valid_secs;
}
static int read_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
struct pblk_line *line, sector_t lba)
struct pblk_line *line, sector_t lba,
u64 paddr_gc)
{
struct ppa_addr ppa;
struct ppa_addr ppa_l2p, ppa_gc;
int valid_secs = 0;
if (lba == ADDR_EMPTY)
@ -445,15 +503,14 @@ static int read_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
}
spin_lock(&pblk->trans_lock);
ppa = pblk_trans_map_get(pblk, lba);
ppa_l2p = pblk_trans_map_get(pblk, lba);
spin_unlock(&pblk->trans_lock);
/* Ignore updated values until the moment */
if (pblk_addr_in_cache(ppa) || ppa.g.blk != line->id ||
pblk_ppa_empty(ppa))
ppa_gc = addr_to_gen_ppa(pblk, paddr_gc, line->id);
if (!pblk_ppa_comp(ppa_l2p, ppa_gc))
goto out;
rqd->ppa_addr = ppa;
rqd->ppa_addr = ppa_l2p;
valid_secs = 1;
#ifdef CONFIG_NVM_DEBUG
@ -464,42 +521,44 @@ out:
return valid_secs;
}
int pblk_submit_read_gc(struct pblk *pblk, u64 *lba_list, void *data,
unsigned int nr_secs, unsigned int *secs_to_gc,
struct pblk_line *line)
int pblk_submit_read_gc(struct pblk *pblk, struct pblk_gc_rq *gc_rq)
{
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
struct bio *bio;
struct nvm_rq rqd;
int ret, data_len;
DECLARE_COMPLETION_ONSTACK(wait);
int data_len;
int ret = NVM_IO_OK;
memset(&rqd, 0, sizeof(struct nvm_rq));
rqd.meta_list = nvm_dev_dma_alloc(dev->parent, GFP_KERNEL,
&rqd.dma_meta_list);
if (!rqd.meta_list)
return NVM_IO_ERR;
return -ENOMEM;
if (nr_secs > 1) {
if (gc_rq->nr_secs > 1) {
rqd.ppa_list = rqd.meta_list + pblk_dma_meta_size;
rqd.dma_ppa_list = rqd.dma_meta_list + pblk_dma_meta_size;
*secs_to_gc = read_ppalist_rq_gc(pblk, &rqd, line, lba_list,
nr_secs);
if (*secs_to_gc == 1)
gc_rq->secs_to_gc = read_ppalist_rq_gc(pblk, &rqd, gc_rq->line,
gc_rq->lba_list,
gc_rq->paddr_list,
gc_rq->nr_secs);
if (gc_rq->secs_to_gc == 1)
rqd.ppa_addr = rqd.ppa_list[0];
} else {
*secs_to_gc = read_rq_gc(pblk, &rqd, line, lba_list[0]);
gc_rq->secs_to_gc = read_rq_gc(pblk, &rqd, gc_rq->line,
gc_rq->lba_list[0],
gc_rq->paddr_list[0]);
}
if (!(*secs_to_gc))
if (!(gc_rq->secs_to_gc))
goto out;
data_len = (*secs_to_gc) * geo->sec_size;
bio = pblk_bio_map_addr(pblk, data, *secs_to_gc, data_len,
PBLK_KMALLOC_META, GFP_KERNEL);
data_len = (gc_rq->secs_to_gc) * geo->sec_size;
bio = pblk_bio_map_addr(pblk, gc_rq->data, gc_rq->secs_to_gc, data_len,
PBLK_VMALLOC_META, GFP_KERNEL);
if (IS_ERR(bio)) {
pr_err("pblk: could not allocate GC bio (%lu)\n", PTR_ERR(bio));
goto err_free_dma;
@ -509,23 +568,16 @@ int pblk_submit_read_gc(struct pblk *pblk, u64 *lba_list, void *data,
bio_set_op_attrs(bio, REQ_OP_READ, 0);
rqd.opcode = NVM_OP_PREAD;
rqd.end_io = pblk_end_io_sync;
rqd.private = &wait;
rqd.nr_ppas = *secs_to_gc;
rqd.nr_ppas = gc_rq->secs_to_gc;
rqd.flags = pblk_set_read_mode(pblk, PBLK_READ_RANDOM);
rqd.bio = bio;
ret = pblk_submit_read_io(pblk, &rqd);
if (ret) {
bio_endio(bio);
if (pblk_submit_io_sync(pblk, &rqd)) {
ret = -EIO;
pr_err("pblk: GC read request failed\n");
goto err_free_dma;
goto err_free_bio;
}
if (!wait_for_completion_io_timeout(&wait,
msecs_to_jiffies(PBLK_COMMAND_TIMEOUT_MS))) {
pr_err("pblk: GC read I/O timed out\n");
}
atomic_dec(&pblk->inflight_io);
if (rqd.error) {
@ -536,16 +588,18 @@ int pblk_submit_read_gc(struct pblk *pblk, u64 *lba_list, void *data,
}
#ifdef CONFIG_NVM_DEBUG
atomic_long_add(*secs_to_gc, &pblk->sync_reads);
atomic_long_add(*secs_to_gc, &pblk->recov_gc_reads);
atomic_long_sub(*secs_to_gc, &pblk->inflight_reads);
atomic_long_add(gc_rq->secs_to_gc, &pblk->sync_reads);
atomic_long_add(gc_rq->secs_to_gc, &pblk->recov_gc_reads);
atomic_long_sub(gc_rq->secs_to_gc, &pblk->inflight_reads);
#endif
out:
nvm_dev_dma_free(dev->parent, rqd.meta_list, rqd.dma_meta_list);
return NVM_IO_OK;
return ret;
err_free_bio:
bio_put(bio);
err_free_dma:
nvm_dev_dma_free(dev->parent, rqd.meta_list, rqd.dma_meta_list);
return NVM_IO_ERR;
return ret;
}

View File

@ -34,10 +34,6 @@ void pblk_submit_rec(struct work_struct *work)
max_secs);
bio = bio_alloc(GFP_KERNEL, nr_rec_secs);
if (!bio) {
pr_err("pblk: not able to create recovery bio\n");
return;
}
bio->bi_iter.bi_sector = 0;
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
@ -71,7 +67,7 @@ void pblk_submit_rec(struct work_struct *work)
err:
bio_put(bio);
pblk_free_rqd(pblk, rqd, WRITE);
pblk_free_rqd(pblk, rqd, PBLK_WRITE);
}
int pblk_recov_setup_rq(struct pblk *pblk, struct pblk_c_ctx *c_ctx,
@ -84,12 +80,7 @@ int pblk_recov_setup_rq(struct pblk *pblk, struct pblk_c_ctx *c_ctx,
struct pblk_c_ctx *rec_ctx;
int nr_entries = c_ctx->nr_valid + c_ctx->nr_padded;
rec_rqd = pblk_alloc_rqd(pblk, WRITE);
if (IS_ERR(rec_rqd)) {
pr_err("pblk: could not create recovery req.\n");
return -ENOMEM;
}
rec_rqd = pblk_alloc_rqd(pblk, PBLK_WRITE);
rec_ctx = nvm_rq_to_pdu(rec_rqd);
/* Copy completion bitmap, but exclude the first X completed entries */
@ -142,19 +133,19 @@ static int pblk_recov_l2p_from_emeta(struct pblk *pblk, struct pblk_line *line)
struct pblk_emeta *emeta = line->emeta;
struct line_emeta *emeta_buf = emeta->buf;
__le64 *lba_list;
int data_start;
int nr_data_lbas, nr_valid_lbas, nr_lbas = 0;
int i;
u64 data_start, data_end;
u64 nr_valid_lbas, nr_lbas = 0;
u64 i;
lba_list = pblk_recov_get_lba_list(pblk, emeta_buf);
if (!lba_list)
return 1;
data_start = pblk_line_smeta_start(pblk, line) + lm->smeta_sec;
nr_data_lbas = lm->sec_per_line - lm->emeta_sec[0];
data_end = line->emeta_ssec;
nr_valid_lbas = le64_to_cpu(emeta_buf->nr_valid_lbas);
for (i = data_start; i < nr_data_lbas && nr_lbas < nr_valid_lbas; i++) {
for (i = data_start; i < data_end; i++) {
struct ppa_addr ppa;
int pos;
@ -181,8 +172,8 @@ static int pblk_recov_l2p_from_emeta(struct pblk *pblk, struct pblk_line *line)
}
if (nr_valid_lbas != nr_lbas)
pr_err("pblk: line %d - inconsistent lba list(%llu/%d)\n",
line->id, emeta_buf->nr_valid_lbas, nr_lbas);
pr_err("pblk: line %d - inconsistent lba list(%llu/%llu)\n",
line->id, nr_valid_lbas, nr_lbas);
line->left_msecs = 0;
@ -225,7 +216,6 @@ static int pblk_recov_read_oob(struct pblk *pblk, struct pblk_line *line,
int rq_ppas, rq_len;
int i, j;
int ret = 0;
DECLARE_COMPLETION_ONSTACK(wait);
ppa_list = p.ppa_list;
meta_list = p.meta_list;
@ -262,8 +252,6 @@ next_read_rq:
rqd->ppa_list = ppa_list;
rqd->dma_ppa_list = dma_ppa_list;
rqd->dma_meta_list = dma_meta_list;
rqd->end_io = pblk_end_io_sync;
rqd->private = &wait;
if (pblk_io_aligned(pblk, rq_ppas))
rqd->flags = pblk_set_read_mode(pblk, PBLK_READ_SEQUENTIAL);
@ -289,19 +277,13 @@ next_read_rq:
}
/* If read fails, more padding is needed */
ret = pblk_submit_io(pblk, rqd);
ret = pblk_submit_io_sync(pblk, rqd);
if (ret) {
pr_err("pblk: I/O submission failed: %d\n", ret);
return ret;
}
if (!wait_for_completion_io_timeout(&wait,
msecs_to_jiffies(PBLK_COMMAND_TIMEOUT_MS))) {
pr_err("pblk: L2P recovery read timed out\n");
return -EINTR;
}
atomic_dec(&pblk->inflight_io);
reinit_completion(&wait);
/* At this point, the read should not fail. If it does, it is a problem
* we cannot recover from here. Need FTL log.
@ -338,13 +320,10 @@ static void pblk_end_io_recov(struct nvm_rq *rqd)
{
struct pblk_pad_rq *pad_rq = rqd->private;
struct pblk *pblk = pad_rq->pblk;
struct nvm_tgt_dev *dev = pblk->dev;
pblk_up_page(pblk, rqd->ppa_list, rqd->nr_ppas);
bio_put(rqd->bio);
nvm_dev_dma_free(dev->parent, rqd->meta_list, rqd->dma_meta_list);
pblk_free_rqd(pblk, rqd, WRITE);
pblk_free_rqd(pblk, rqd, PBLK_WRITE_INT);
atomic_dec(&pblk->inflight_io);
kref_put(&pad_rq->ref, pblk_recov_complete);
@ -404,25 +383,21 @@ next_pad_rq:
ppa_list = (void *)(meta_list) + pblk_dma_meta_size;
dma_ppa_list = dma_meta_list + pblk_dma_meta_size;
rqd = pblk_alloc_rqd(pblk, WRITE);
if (IS_ERR(rqd)) {
ret = PTR_ERR(rqd);
goto fail_free_meta;
}
bio = pblk_bio_map_addr(pblk, data, rq_ppas, rq_len,
PBLK_VMALLOC_META, GFP_KERNEL);
if (IS_ERR(bio)) {
ret = PTR_ERR(bio);
goto fail_free_rqd;
goto fail_free_meta;
}
bio->bi_iter.bi_sector = 0; /* internal bio */
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
rqd = pblk_alloc_rqd(pblk, PBLK_WRITE_INT);
rqd->bio = bio;
rqd->opcode = NVM_OP_PWRITE;
rqd->flags = pblk_set_progr_mode(pblk, WRITE);
rqd->flags = pblk_set_progr_mode(pblk, PBLK_WRITE);
rqd->meta_list = meta_list;
rqd->nr_ppas = rq_ppas;
rqd->ppa_list = ppa_list;
@ -490,8 +465,6 @@ free_rq:
fail_free_bio:
bio_put(bio);
fail_free_rqd:
pblk_free_rqd(pblk, rqd, WRITE);
fail_free_meta:
nvm_dev_dma_free(dev->parent, meta_list, dma_meta_list);
fail_free_pad:
@ -522,7 +495,6 @@ static int pblk_recov_scan_all_oob(struct pblk *pblk, struct pblk_line *line,
int ret = 0;
int rec_round;
int left_ppas = pblk_calc_sec_in_line(pblk, line) - line->cur_sec;
DECLARE_COMPLETION_ONSTACK(wait);
ppa_list = p.ppa_list;
meta_list = p.meta_list;
@ -557,8 +529,6 @@ next_rq:
rqd->ppa_list = ppa_list;
rqd->dma_ppa_list = dma_ppa_list;
rqd->dma_meta_list = dma_meta_list;
rqd->end_io = pblk_end_io_sync;
rqd->private = &wait;
if (pblk_io_aligned(pblk, rq_ppas))
rqd->flags = pblk_set_read_mode(pblk, PBLK_READ_SEQUENTIAL);
@ -584,18 +554,13 @@ next_rq:
addr_to_gen_ppa(pblk, w_ptr, line->id);
}
ret = pblk_submit_io(pblk, rqd);
ret = pblk_submit_io_sync(pblk, rqd);
if (ret) {
pr_err("pblk: I/O submission failed: %d\n", ret);
return ret;
}
if (!wait_for_completion_io_timeout(&wait,
msecs_to_jiffies(PBLK_COMMAND_TIMEOUT_MS))) {
pr_err("pblk: L2P recovery read timed out\n");
}
atomic_dec(&pblk->inflight_io);
reinit_completion(&wait);
/* This should not happen since the read failed during normal recovery,
* but the media works funny sometimes...
@ -663,7 +628,6 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct pblk_line *line,
int i, j;
int ret = 0;
int left_ppas = pblk_calc_sec_in_line(pblk, line);
DECLARE_COMPLETION_ONSTACK(wait);
ppa_list = p.ppa_list;
meta_list = p.meta_list;
@ -696,8 +660,6 @@ next_rq:
rqd->ppa_list = ppa_list;
rqd->dma_ppa_list = dma_ppa_list;
rqd->dma_meta_list = dma_meta_list;
rqd->end_io = pblk_end_io_sync;
rqd->private = &wait;
if (pblk_io_aligned(pblk, rq_ppas))
rqd->flags = pblk_set_read_mode(pblk, PBLK_READ_SEQUENTIAL);
@ -723,19 +685,14 @@ next_rq:
addr_to_gen_ppa(pblk, paddr, line->id);
}
ret = pblk_submit_io(pblk, rqd);
ret = pblk_submit_io_sync(pblk, rqd);
if (ret) {
pr_err("pblk: I/O submission failed: %d\n", ret);
bio_put(bio);
return ret;
}
if (!wait_for_completion_io_timeout(&wait,
msecs_to_jiffies(PBLK_COMMAND_TIMEOUT_MS))) {
pr_err("pblk: L2P recovery read timed out\n");
}
atomic_dec(&pblk->inflight_io);
reinit_completion(&wait);
/* Reached the end of the written line */
if (rqd->error) {
@ -785,15 +742,9 @@ static int pblk_recov_l2p_from_oob(struct pblk *pblk, struct pblk_line *line)
dma_addr_t dma_ppa_list, dma_meta_list;
int done, ret = 0;
rqd = pblk_alloc_rqd(pblk, READ);
if (IS_ERR(rqd))
return PTR_ERR(rqd);
meta_list = nvm_dev_dma_alloc(dev->parent, GFP_KERNEL, &dma_meta_list);
if (!meta_list) {
ret = -ENOMEM;
goto free_rqd;
}
if (!meta_list)
return -ENOMEM;
ppa_list = (void *)(meta_list) + pblk_dma_meta_size;
dma_ppa_list = dma_meta_list + pblk_dma_meta_size;
@ -804,6 +755,8 @@ static int pblk_recov_l2p_from_oob(struct pblk *pblk, struct pblk_line *line)
goto free_meta_list;
}
rqd = pblk_alloc_rqd(pblk, PBLK_READ);
p.ppa_list = ppa_list;
p.meta_list = meta_list;
p.rqd = rqd;
@ -832,8 +785,6 @@ out:
kfree(data);
free_meta_list:
nvm_dev_dma_free(dev->parent, meta_list, dma_meta_list);
free_rqd:
pblk_free_rqd(pblk, rqd, READ);
return ret;
}
@ -851,10 +802,32 @@ static void pblk_recov_line_add_ordered(struct list_head *head,
__list_add(&line->list, t->list.prev, &t->list);
}
struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
static u64 pblk_line_emeta_start(struct pblk *pblk, struct pblk_line *line)
{
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
struct pblk_line_meta *lm = &pblk->lm;
unsigned int emeta_secs;
u64 emeta_start;
struct ppa_addr ppa;
int pos;
emeta_secs = lm->emeta_sec[0];
emeta_start = lm->sec_per_line;
while (emeta_secs) {
emeta_start--;
ppa = addr_to_pblk_ppa(pblk, emeta_start, line->id);
pos = pblk_ppa_to_pos(geo, ppa);
if (!test_bit(pos, line->blk_bitmap))
emeta_secs--;
}
return emeta_start;
}
struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
{
struct pblk_line_meta *lm = &pblk->lm;
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
struct pblk_line *line, *tline, *data_line = NULL;
@ -900,9 +873,9 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
if (le32_to_cpu(smeta_buf->header.identifier) != PBLK_MAGIC)
continue;
if (le16_to_cpu(smeta_buf->header.version) != 1) {
if (smeta_buf->header.version != SMETA_VERSION) {
pr_err("pblk: found incompatible line version %u\n",
smeta_buf->header.version);
le16_to_cpu(smeta_buf->header.version));
return ERR_PTR(-EINVAL);
}
@ -954,15 +927,9 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
/* Verify closed blocks and recover this portion of L2P table*/
list_for_each_entry_safe(line, tline, &recov_list, list) {
int off, nr_bb;
recovered_lines++;
/* Calculate where emeta starts based on the line bb */
off = lm->sec_per_line - lm->emeta_sec[0];
nr_bb = bitmap_weight(line->blk_bitmap, lm->blk_per_line);
off -= nr_bb * geo->sec_per_pl;
line->emeta_ssec = off;
line->emeta_ssec = pblk_line_emeta_start(pblk, line);
line->emeta = emeta;
memset(line->emeta->buf, 0, lm->emeta_len[0]);
@ -987,7 +954,7 @@ next:
list_move_tail(&line->list, move_list);
spin_unlock(&l_mg->gc_lock);
mempool_free(line->map_bitmap, pblk->line_meta_pool);
kfree(line->map_bitmap);
line->map_bitmap = NULL;
line->smeta = NULL;
line->emeta = NULL;

View File

@ -96,9 +96,11 @@ unsigned long pblk_rl_nr_free_blks(struct pblk_rl *rl)
*
* Only the total number of free blocks is used to configure the rate limiter.
*/
static int pblk_rl_update_rates(struct pblk_rl *rl, unsigned long max)
void pblk_rl_update_rates(struct pblk_rl *rl)
{
struct pblk *pblk = container_of(rl, struct pblk, rl);
unsigned long free_blocks = pblk_rl_nr_free_blks(rl);
int max = rl->rb_budget;
if (free_blocks >= rl->high) {
rl->rb_user_max = max;
@ -124,23 +126,18 @@ static int pblk_rl_update_rates(struct pblk_rl *rl, unsigned long max)
rl->rb_state = PBLK_RL_LOW;
}
return rl->rb_state;
if (rl->rb_state == (PBLK_RL_MID | PBLK_RL_LOW))
pblk_gc_should_start(pblk);
else
pblk_gc_should_stop(pblk);
}
void pblk_rl_free_lines_inc(struct pblk_rl *rl, struct pblk_line *line)
{
struct pblk *pblk = container_of(rl, struct pblk, rl);
int blk_in_line = atomic_read(&line->blk_in_line);
int ret;
atomic_add(blk_in_line, &rl->free_blocks);
/* Rates will not change that often - no need to lock update */
ret = pblk_rl_update_rates(rl, rl->rb_budget);
if (ret == (PBLK_RL_MID | PBLK_RL_LOW))
pblk_gc_should_start(pblk);
else
pblk_gc_should_stop(pblk);
pblk_rl_update_rates(rl);
}
void pblk_rl_free_lines_dec(struct pblk_rl *rl, struct pblk_line *line)
@ -148,19 +145,7 @@ void pblk_rl_free_lines_dec(struct pblk_rl *rl, struct pblk_line *line)
int blk_in_line = atomic_read(&line->blk_in_line);
atomic_sub(blk_in_line, &rl->free_blocks);
}
void pblk_gc_should_kick(struct pblk *pblk)
{
struct pblk_rl *rl = &pblk->rl;
int ret;
/* Rates will not change that often - no need to lock update */
ret = pblk_rl_update_rates(rl, rl->rb_budget);
if (ret == (PBLK_RL_MID | PBLK_RL_LOW))
pblk_gc_should_start(pblk);
else
pblk_gc_should_stop(pblk);
pblk_rl_update_rates(rl);
}
int pblk_rl_high_thrs(struct pblk_rl *rl)
@ -168,14 +153,9 @@ int pblk_rl_high_thrs(struct pblk_rl *rl)
return rl->high;
}
int pblk_rl_low_thrs(struct pblk_rl *rl)
int pblk_rl_max_io(struct pblk_rl *rl)
{
return rl->low;
}
int pblk_rl_sysfs_rate_show(struct pblk_rl *rl)
{
return rl->rb_user_max;
return rl->rb_max_io;
}
static void pblk_rl_u_timer(unsigned long data)
@ -214,6 +194,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
/* To start with, all buffer is available to user I/O writers */
rl->rb_budget = budget;
rl->rb_user_max = budget;
rl->rb_max_io = budget >> 1;
rl->rb_gc_max = 0;
rl->rb_state = PBLK_RL_HIGH;

View File

@ -253,7 +253,7 @@ static ssize_t pblk_sysfs_lines(struct pblk *pblk, char *page)
sz += snprintf(page + sz, PAGE_SIZE - sz,
"GC: full:%d, high:%d, mid:%d, low:%d, empty:%d, queue:%d\n",
gc_full, gc_high, gc_mid, gc_low, gc_empty,
atomic_read(&pblk->gc.inflight_gc));
atomic_read(&pblk->gc.read_inflight_gc));
sz += snprintf(page + sz, PAGE_SIZE - sz,
"data (%d) cur:%d, left:%d, vsc:%d, s:%d, map:%d/%d (%d)\n",

View File

@ -20,7 +20,6 @@
static unsigned long pblk_end_w_bio(struct pblk *pblk, struct nvm_rq *rqd,
struct pblk_c_ctx *c_ctx)
{
struct nvm_tgt_dev *dev = pblk->dev;
struct bio *original_bio;
unsigned long ret;
int i;
@ -33,16 +32,18 @@ static unsigned long pblk_end_w_bio(struct pblk *pblk, struct nvm_rq *rqd,
bio_endio(original_bio);
}
if (c_ctx->nr_padded)
pblk_bio_free_pages(pblk, rqd->bio, c_ctx->nr_valid,
c_ctx->nr_padded);
#ifdef CONFIG_NVM_DEBUG
atomic_long_add(c_ctx->nr_valid, &pblk->sync_writes);
atomic_long_add(rqd->nr_ppas, &pblk->sync_writes);
#endif
ret = pblk_rb_sync_advance(&pblk->rwb, c_ctx->nr_valid);
nvm_dev_dma_free(dev->parent, rqd->meta_list, rqd->dma_meta_list);
bio_put(rqd->bio);
pblk_free_rqd(pblk, rqd, WRITE);
pblk_free_rqd(pblk, rqd, PBLK_WRITE);
return ret;
}
@ -107,10 +108,7 @@ static void pblk_end_w_fail(struct pblk *pblk, struct nvm_rq *rqd)
ppa_list = &rqd->ppa_addr;
recovery = mempool_alloc(pblk->rec_pool, GFP_ATOMIC);
if (!recovery) {
pr_err("pblk: could not allocate recovery context\n");
return;
}
INIT_LIST_HEAD(&recovery->failed);
bit = -1;
@ -175,7 +173,6 @@ static void pblk_end_io_write(struct nvm_rq *rqd)
static void pblk_end_io_write_meta(struct nvm_rq *rqd)
{
struct pblk *pblk = rqd->private;
struct nvm_tgt_dev *dev = pblk->dev;
struct pblk_g_ctx *m_ctx = nvm_rq_to_pdu(rqd);
struct pblk_line *line = m_ctx->private;
struct pblk_emeta *emeta = line->emeta;
@ -187,19 +184,13 @@ static void pblk_end_io_write_meta(struct nvm_rq *rqd)
pblk_log_write_err(pblk, rqd);
pr_err("pblk: metadata I/O failed. Line %d\n", line->id);
}
#ifdef CONFIG_NVM_DEBUG
else
WARN_ONCE(rqd->bio->bi_status, "pblk: corrupted write error\n");
#endif
sync = atomic_add_return(rqd->nr_ppas, &emeta->sync);
if (sync == emeta->nr_entries)
pblk_line_run_ws(pblk, line, NULL, pblk_line_close_ws,
pblk->close_wq);
pblk_gen_run_ws(pblk, line, NULL, pblk_line_close_ws,
GFP_ATOMIC, pblk->close_wq);
bio_put(rqd->bio);
nvm_dev_dma_free(dev->parent, rqd->meta_list, rqd->dma_meta_list);
pblk_free_rqd(pblk, rqd, READ);
pblk_free_rqd(pblk, rqd, PBLK_WRITE_INT);
atomic_dec(&pblk->inflight_io);
}
@ -213,7 +204,7 @@ static int pblk_alloc_w_rq(struct pblk *pblk, struct nvm_rq *rqd,
/* Setup write request */
rqd->opcode = NVM_OP_PWRITE;
rqd->nr_ppas = nr_secs;
rqd->flags = pblk_set_progr_mode(pblk, WRITE);
rqd->flags = pblk_set_progr_mode(pblk, PBLK_WRITE);
rqd->private = pblk;
rqd->end_io = end_io;
@ -229,15 +220,16 @@ static int pblk_alloc_w_rq(struct pblk *pblk, struct nvm_rq *rqd,
}
static int pblk_setup_w_rq(struct pblk *pblk, struct nvm_rq *rqd,
struct pblk_c_ctx *c_ctx, struct ppa_addr *erase_ppa)
struct ppa_addr *erase_ppa)
{
struct pblk_line_meta *lm = &pblk->lm;
struct pblk_line *e_line = pblk_line_get_erase(pblk);
struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
unsigned int valid = c_ctx->nr_valid;
unsigned int padded = c_ctx->nr_padded;
unsigned int nr_secs = valid + padded;
unsigned long *lun_bitmap;
int ret = 0;
int ret;
lun_bitmap = kzalloc(lm->lun_bitmap_len, GFP_KERNEL);
if (!lun_bitmap)
@ -279,7 +271,7 @@ int pblk_setup_w_rec_rq(struct pblk *pblk, struct nvm_rq *rqd,
pblk_map_rq(pblk, rqd, c_ctx->sentry, lun_bitmap, c_ctx->nr_valid, 0);
rqd->ppa_status = (u64)0;
rqd->flags = pblk_set_progr_mode(pblk, WRITE);
rqd->flags = pblk_set_progr_mode(pblk, PBLK_WRITE);
return ret;
}
@ -303,55 +295,6 @@ static int pblk_calc_secs_to_sync(struct pblk *pblk, unsigned int secs_avail,
return secs_to_sync;
}
static inline int pblk_valid_meta_ppa(struct pblk *pblk,
struct pblk_line *meta_line,
struct ppa_addr *ppa_list, int nr_ppas)
{
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
struct pblk_line *data_line;
struct ppa_addr ppa, ppa_opt;
u64 paddr;
int i;
data_line = &pblk->lines[pblk_dev_ppa_to_line(ppa_list[0])];
paddr = pblk_lookup_page(pblk, meta_line);
ppa = addr_to_gen_ppa(pblk, paddr, 0);
if (test_bit(pblk_ppa_to_pos(geo, ppa), data_line->blk_bitmap))
return 1;
/* Schedule a metadata I/O that is half the distance from the data I/O
* with regards to the number of LUNs forming the pblk instance. This
* balances LUN conflicts across every I/O.
*
* When the LUN configuration changes (e.g., due to GC), this distance
* can align, which would result on a LUN deadlock. In this case, modify
* the distance to not be optimal, but allow metadata I/Os to succeed.
*/
ppa_opt = addr_to_gen_ppa(pblk, paddr + data_line->meta_distance, 0);
if (unlikely(ppa_opt.ppa == ppa.ppa)) {
data_line->meta_distance--;
return 0;
}
for (i = 0; i < nr_ppas; i += pblk->min_write_pgs)
if (ppa_list[i].g.ch == ppa_opt.g.ch &&
ppa_list[i].g.lun == ppa_opt.g.lun)
return 1;
if (test_bit(pblk_ppa_to_pos(geo, ppa_opt), data_line->blk_bitmap)) {
for (i = 0; i < nr_ppas; i += pblk->min_write_pgs)
if (ppa_list[i].g.ch == ppa.g.ch &&
ppa_list[i].g.lun == ppa.g.lun)
return 0;
return 1;
}
return 0;
}
int pblk_submit_meta_io(struct pblk *pblk, struct pblk_line *meta_line)
{
struct nvm_tgt_dev *dev = pblk->dev;
@ -370,11 +313,8 @@ int pblk_submit_meta_io(struct pblk *pblk, struct pblk_line *meta_line)
int i, j;
int ret;
rqd = pblk_alloc_rqd(pblk, READ);
if (IS_ERR(rqd)) {
pr_err("pblk: cannot allocate write req.\n");
return PTR_ERR(rqd);
}
rqd = pblk_alloc_rqd(pblk, PBLK_WRITE_INT);
m_ctx = nvm_rq_to_pdu(rqd);
m_ctx->private = meta_line;
@ -407,8 +347,6 @@ int pblk_submit_meta_io(struct pblk *pblk, struct pblk_line *meta_line)
if (emeta->mem >= lm->emeta_len[0]) {
spin_lock(&l_mg->close_lock);
list_del(&meta_line->list);
WARN(!bitmap_full(meta_line->map_bitmap, lm->sec_per_line),
"pblk: corrupt meta line %d\n", meta_line->id);
spin_unlock(&l_mg->close_lock);
}
@ -428,18 +366,51 @@ fail_rollback:
pblk_dealloc_page(pblk, meta_line, rq_ppas);
list_add(&meta_line->list, &meta_line->list);
spin_unlock(&l_mg->close_lock);
nvm_dev_dma_free(dev->parent, rqd->meta_list, rqd->dma_meta_list);
fail_free_bio:
if (likely(l_mg->emeta_alloc_type == PBLK_VMALLOC_META))
bio_put(bio);
bio_put(bio);
fail_free_rqd:
pblk_free_rqd(pblk, rqd, READ);
pblk_free_rqd(pblk, rqd, PBLK_WRITE_INT);
return ret;
}
static int pblk_sched_meta_io(struct pblk *pblk, struct ppa_addr *prev_list,
int prev_n)
static inline bool pblk_valid_meta_ppa(struct pblk *pblk,
struct pblk_line *meta_line,
struct nvm_rq *data_rqd)
{
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
struct pblk_c_ctx *data_c_ctx = nvm_rq_to_pdu(data_rqd);
struct pblk_line *data_line = pblk_line_get_data(pblk);
struct ppa_addr ppa, ppa_opt;
u64 paddr;
int pos_opt;
/* Schedule a metadata I/O that is half the distance from the data I/O
* with regards to the number of LUNs forming the pblk instance. This
* balances LUN conflicts across every I/O.
*
* When the LUN configuration changes (e.g., due to GC), this distance
* can align, which would result on metadata and data I/Os colliding. In
* this case, modify the distance to not be optimal, but move the
* optimal in the right direction.
*/
paddr = pblk_lookup_page(pblk, meta_line);
ppa = addr_to_gen_ppa(pblk, paddr, 0);
ppa_opt = addr_to_gen_ppa(pblk, paddr + data_line->meta_distance, 0);
pos_opt = pblk_ppa_to_pos(geo, ppa_opt);
if (test_bit(pos_opt, data_c_ctx->lun_bitmap) ||
test_bit(pos_opt, data_line->blk_bitmap))
return true;
if (unlikely(pblk_ppa_comp(ppa_opt, ppa)))
data_line->meta_distance--;
return false;
}
static struct pblk_line *pblk_should_submit_meta_io(struct pblk *pblk,
struct nvm_rq *data_rqd)
{
struct pblk_line_meta *lm = &pblk->lm;
struct pblk_line_mgmt *l_mg = &pblk->l_mg;
@ -449,57 +420,45 @@ static int pblk_sched_meta_io(struct pblk *pblk, struct ppa_addr *prev_list,
retry:
if (list_empty(&l_mg->emeta_list)) {
spin_unlock(&l_mg->close_lock);
return 0;
return NULL;
}
meta_line = list_first_entry(&l_mg->emeta_list, struct pblk_line, list);
if (bitmap_full(meta_line->map_bitmap, lm->sec_per_line))
if (meta_line->emeta->mem >= lm->emeta_len[0])
goto retry;
spin_unlock(&l_mg->close_lock);
if (!pblk_valid_meta_ppa(pblk, meta_line, prev_list, prev_n))
return 0;
if (!pblk_valid_meta_ppa(pblk, meta_line, data_rqd))
return NULL;
return pblk_submit_meta_io(pblk, meta_line);
return meta_line;
}
static int pblk_submit_io_set(struct pblk *pblk, struct nvm_rq *rqd)
{
struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
struct ppa_addr erase_ppa;
struct pblk_line *meta_line;
int err;
ppa_set_empty(&erase_ppa);
/* Assign lbas to ppas and populate request structure */
err = pblk_setup_w_rq(pblk, rqd, c_ctx, &erase_ppa);
err = pblk_setup_w_rq(pblk, rqd, &erase_ppa);
if (err) {
pr_err("pblk: could not setup write request: %d\n", err);
return NVM_IO_ERR;
}
if (likely(ppa_empty(erase_ppa))) {
/* Submit metadata write for previous data line */
err = pblk_sched_meta_io(pblk, rqd->ppa_list, rqd->nr_ppas);
if (err) {
pr_err("pblk: metadata I/O submission failed: %d", err);
return NVM_IO_ERR;
}
meta_line = pblk_should_submit_meta_io(pblk, rqd);
/* Submit data write for current data line */
err = pblk_submit_io(pblk, rqd);
if (err) {
pr_err("pblk: data I/O submission failed: %d\n", err);
return NVM_IO_ERR;
}
} else {
/* Submit data write for current data line */
err = pblk_submit_io(pblk, rqd);
if (err) {
pr_err("pblk: data I/O submission failed: %d\n", err);
return NVM_IO_ERR;
}
/* Submit data write for current data line */
err = pblk_submit_io(pblk, rqd);
if (err) {
pr_err("pblk: data I/O submission failed: %d\n", err);
return NVM_IO_ERR;
}
/* Submit available erase for next data line */
if (!ppa_empty(erase_ppa)) {
/* Submit erase for next data line */
if (pblk_blk_erase_async(pblk, erase_ppa)) {
struct pblk_line *e_line = pblk_line_get_erase(pblk);
struct nvm_tgt_dev *dev = pblk->dev;
@ -512,6 +471,15 @@ static int pblk_submit_io_set(struct pblk *pblk, struct nvm_rq *rqd)
}
}
if (meta_line) {
/* Submit metadata write for previous data line */
err = pblk_submit_meta_io(pblk, meta_line);
if (err) {
pr_err("pblk: metadata I/O submission failed: %d", err);
return NVM_IO_ERR;
}
}
return NVM_IO_OK;
}
@ -521,7 +489,8 @@ static void pblk_free_write_rqd(struct pblk *pblk, struct nvm_rq *rqd)
struct bio *bio = rqd->bio;
if (c_ctx->nr_padded)
pblk_bio_free_pages(pblk, bio, rqd->nr_ppas, c_ctx->nr_padded);
pblk_bio_free_pages(pblk, bio, c_ctx->nr_valid,
c_ctx->nr_padded);
}
static int pblk_submit_write(struct pblk *pblk)
@ -543,31 +512,24 @@ static int pblk_submit_write(struct pblk *pblk)
if (!secs_to_flush && secs_avail < pblk->min_write_pgs)
return 1;
rqd = pblk_alloc_rqd(pblk, WRITE);
if (IS_ERR(rqd)) {
pr_err("pblk: cannot allocate write req.\n");
return 1;
}
bio = bio_alloc(GFP_KERNEL, pblk->max_write_pgs);
if (!bio) {
pr_err("pblk: cannot allocate write bio\n");
goto fail_free_rqd;
}
bio->bi_iter.bi_sector = 0; /* internal bio */
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
rqd->bio = bio;
secs_to_sync = pblk_calc_secs_to_sync(pblk, secs_avail, secs_to_flush);
if (secs_to_sync > pblk->max_write_pgs) {
pr_err("pblk: bad buffer sync calculation\n");
goto fail_put_bio;
return 1;
}
secs_to_com = (secs_to_sync > secs_avail) ? secs_avail : secs_to_sync;
pos = pblk_rb_read_commit(&pblk->rwb, secs_to_com);
if (pblk_rb_read_to_bio(&pblk->rwb, rqd, bio, pos, secs_to_sync,
bio = bio_alloc(GFP_KERNEL, secs_to_sync);
bio->bi_iter.bi_sector = 0; /* internal bio */
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
rqd = pblk_alloc_rqd(pblk, PBLK_WRITE);
rqd->bio = bio;
if (pblk_rb_read_to_bio(&pblk->rwb, rqd, pos, secs_to_sync,
secs_avail)) {
pr_err("pblk: corrupted write bio\n");
goto fail_put_bio;
@ -586,8 +548,7 @@ fail_free_bio:
pblk_free_write_rqd(pblk, rqd);
fail_put_bio:
bio_put(bio);
fail_free_rqd:
pblk_free_rqd(pblk, rqd, WRITE);
pblk_free_rqd(pblk, rqd, PBLK_WRITE);
return 1;
}

View File

@ -40,10 +40,6 @@
#define PBLK_MAX_REQ_ADDRS (64)
#define PBLK_MAX_REQ_ADDRS_PW (6)
#define PBLK_WS_POOL_SIZE (128)
#define PBLK_META_POOL_SIZE (128)
#define PBLK_READ_REQ_POOL_SIZE (1024)
#define PBLK_NR_CLOSE_JOBS (4)
#define PBLK_CACHE_NAME_LEN (DISK_NAME_LEN + 16)
@ -59,7 +55,15 @@
for ((i) = 0, rlun = &(pblk)->luns[0]; \
(i) < (pblk)->nr_luns; (i)++, rlun = &(pblk)->luns[(i)])
#define ERASE 2 /* READ = 0, WRITE = 1 */
/* Static pool sizes */
#define PBLK_GEN_WS_POOL_SIZE (2)
enum {
PBLK_READ = READ,
PBLK_WRITE = WRITE,/* Write from write buffer */
PBLK_WRITE_INT, /* Internal write - no write buffer */
PBLK_ERASE,
};
enum {
/* IO Types */
@ -95,6 +99,7 @@ enum {
};
#define pblk_dma_meta_size (sizeof(struct pblk_sec_meta) * PBLK_MAX_REQ_ADDRS)
#define pblk_dma_ppa_size (sizeof(u64) * PBLK_MAX_REQ_ADDRS)
/* write buffer completion context */
struct pblk_c_ctx {
@ -106,9 +111,10 @@ struct pblk_c_ctx {
unsigned int nr_padded;
};
/* generic context */
/* read context */
struct pblk_g_ctx {
void *private;
u64 lba;
};
/* Pad context */
@ -207,6 +213,7 @@ struct pblk_lun {
struct pblk_gc_rq {
struct pblk_line *line;
void *data;
u64 paddr_list[PBLK_MAX_REQ_ADDRS];
u64 lba_list[PBLK_MAX_REQ_ADDRS];
int nr_secs;
int secs_to_gc;
@ -231,7 +238,10 @@ struct pblk_gc {
struct timer_list gc_timer;
struct semaphore gc_sem;
atomic_t inflight_gc;
atomic_t read_inflight_gc; /* Number of lines with inflight GC reads */
atomic_t pipeline_gc; /* Number of lines in the GC pipeline -
* started reads to finished writes
*/
int w_entries;
struct list_head w_list;
@ -267,6 +277,7 @@ struct pblk_rl {
int rb_gc_max; /* Max buffer entries available for GC I/O */
int rb_gc_rsv; /* Reserved buffer entries for GC I/O */
int rb_state; /* Rate-limiter current state */
int rb_max_io; /* Maximum size for an I/O giving the config */
atomic_t rb_user_cnt; /* User I/O buffer counter */
atomic_t rb_gc_cnt; /* GC I/O buffer counter */
@ -310,6 +321,7 @@ enum {
};
#define PBLK_MAGIC 0x70626c6b /*pblk*/
#define SMETA_VERSION cpu_to_le16(1)
struct line_header {
__le32 crc;
@ -618,15 +630,16 @@ struct pblk {
struct list_head compl_list;
mempool_t *page_pool;
mempool_t *line_ws_pool;
mempool_t *page_bio_pool;
mempool_t *gen_ws_pool;
mempool_t *rec_pool;
mempool_t *g_rq_pool;
mempool_t *r_rq_pool;
mempool_t *w_rq_pool;
mempool_t *line_meta_pool;
mempool_t *e_rq_pool;
struct workqueue_struct *close_wq;
struct workqueue_struct *bb_wq;
struct workqueue_struct *r_end_wq;
struct timer_list wtimer;
@ -657,15 +670,15 @@ int pblk_rb_may_write_gc(struct pblk_rb *rb, unsigned int nr_entries,
void pblk_rb_write_entry_user(struct pblk_rb *rb, void *data,
struct pblk_w_ctx w_ctx, unsigned int pos);
void pblk_rb_write_entry_gc(struct pblk_rb *rb, void *data,
struct pblk_w_ctx w_ctx, struct pblk_line *gc_line,
unsigned int pos);
struct pblk_w_ctx w_ctx, struct pblk_line *line,
u64 paddr, unsigned int pos);
struct pblk_w_ctx *pblk_rb_w_ctx(struct pblk_rb *rb, unsigned int pos);
void pblk_rb_flush(struct pblk_rb *rb);
void pblk_rb_sync_l2p(struct pblk_rb *rb);
unsigned int pblk_rb_read_to_bio(struct pblk_rb *rb, struct nvm_rq *rqd,
struct bio *bio, unsigned int pos,
unsigned int nr_entries, unsigned int count);
unsigned int pos, unsigned int nr_entries,
unsigned int count);
unsigned int pblk_rb_read_to_bio_list(struct pblk_rb *rb, struct bio *bio,
struct list_head *list,
unsigned int max);
@ -692,24 +705,23 @@ ssize_t pblk_rb_sysfs(struct pblk_rb *rb, char *buf);
/*
* pblk core
*/
struct nvm_rq *pblk_alloc_rqd(struct pblk *pblk, int rw);
struct nvm_rq *pblk_alloc_rqd(struct pblk *pblk, int type);
void pblk_free_rqd(struct pblk *pblk, struct nvm_rq *rqd, int type);
void pblk_set_sec_per_write(struct pblk *pblk, int sec_per_write);
int pblk_setup_w_rec_rq(struct pblk *pblk, struct nvm_rq *rqd,
struct pblk_c_ctx *c_ctx);
void pblk_free_rqd(struct pblk *pblk, struct nvm_rq *rqd, int rw);
void pblk_wait_for_meta(struct pblk *pblk);
struct ppa_addr pblk_get_lba_map(struct pblk *pblk, sector_t lba);
void pblk_discard(struct pblk *pblk, struct bio *bio);
void pblk_log_write_err(struct pblk *pblk, struct nvm_rq *rqd);
void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd);
int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd);
int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd);
int pblk_submit_meta_io(struct pblk *pblk, struct pblk_line *meta_line);
struct bio *pblk_bio_map_addr(struct pblk *pblk, void *data,
unsigned int nr_secs, unsigned int len,
int alloc_type, gfp_t gfp_mask);
struct pblk_line *pblk_line_get(struct pblk *pblk);
struct pblk_line *pblk_line_get_first_data(struct pblk *pblk);
void pblk_line_replace_data(struct pblk *pblk);
struct pblk_line *pblk_line_replace_data(struct pblk *pblk);
int pblk_line_recov_alloc(struct pblk *pblk, struct pblk_line *line);
void pblk_line_recov_close(struct pblk *pblk, struct pblk_line *line);
struct pblk_line *pblk_line_get_data(struct pblk *pblk);
@ -719,19 +731,18 @@ int pblk_line_is_full(struct pblk_line *line);
void pblk_line_free(struct pblk *pblk, struct pblk_line *line);
void pblk_line_close_meta(struct pblk *pblk, struct pblk_line *line);
void pblk_line_close(struct pblk *pblk, struct pblk_line *line);
void pblk_line_close_meta_sync(struct pblk *pblk);
void pblk_line_close_ws(struct work_struct *work);
void pblk_pipeline_stop(struct pblk *pblk);
void pblk_line_mark_bb(struct work_struct *work);
void pblk_line_run_ws(struct pblk *pblk, struct pblk_line *line, void *priv,
void (*work)(struct work_struct *),
struct workqueue_struct *wq);
void pblk_gen_run_ws(struct pblk *pblk, struct pblk_line *line, void *priv,
void (*work)(struct work_struct *), gfp_t gfp_mask,
struct workqueue_struct *wq);
u64 pblk_line_smeta_start(struct pblk *pblk, struct pblk_line *line);
int pblk_line_read_smeta(struct pblk *pblk, struct pblk_line *line);
int pblk_line_read_emeta(struct pblk *pblk, struct pblk_line *line,
void *emeta_buf);
int pblk_blk_erase_async(struct pblk *pblk, struct ppa_addr erase_ppa);
void pblk_line_put(struct kref *ref);
void pblk_line_put_wq(struct kref *ref);
struct list_head *pblk_line_gc_list(struct pblk *pblk, struct pblk_line *line);
u64 pblk_lookup_page(struct pblk *pblk, struct pblk_line *line);
void pblk_dealloc_page(struct pblk *pblk, struct pblk_line *line, int nr_secs);
@ -745,7 +756,6 @@ void pblk_down_rq(struct pblk *pblk, struct ppa_addr *ppa_list, int nr_ppas,
void pblk_down_page(struct pblk *pblk, struct ppa_addr *ppa_list, int nr_ppas);
void pblk_up_rq(struct pblk *pblk, struct ppa_addr *ppa_list, int nr_ppas,
unsigned long *lun_bitmap);
void pblk_end_bio_sync(struct bio *bio);
void pblk_end_io_sync(struct nvm_rq *rqd);
int pblk_bio_add_pages(struct pblk *pblk, struct bio *bio, gfp_t flags,
int nr_pages);
@ -760,7 +770,7 @@ void pblk_update_map_cache(struct pblk *pblk, sector_t lba,
void pblk_update_map_dev(struct pblk *pblk, sector_t lba,
struct ppa_addr ppa, struct ppa_addr entry_line);
int pblk_update_map_gc(struct pblk *pblk, sector_t lba, struct ppa_addr ppa,
struct pblk_line *gc_line);
struct pblk_line *gc_line, u64 paddr);
void pblk_lookup_l2p_rand(struct pblk *pblk, struct ppa_addr *ppas,
u64 *lba_list, int nr_secs);
void pblk_lookup_l2p_seq(struct pblk *pblk, struct ppa_addr *ppas,
@ -771,9 +781,7 @@ void pblk_lookup_l2p_seq(struct pblk *pblk, struct ppa_addr *ppas,
*/
int pblk_write_to_cache(struct pblk *pblk, struct bio *bio,
unsigned long flags);
int pblk_write_gc_to_cache(struct pblk *pblk, void *data, u64 *lba_list,
unsigned int nr_entries, unsigned int nr_rec_entries,
struct pblk_line *gc_line, unsigned long flags);
int pblk_write_gc_to_cache(struct pblk *pblk, struct pblk_gc_rq *gc_rq);
/*
* pblk map
@ -797,9 +805,7 @@ void pblk_write_should_kick(struct pblk *pblk);
*/
extern struct bio_set *pblk_bio_set;
int pblk_submit_read(struct pblk *pblk, struct bio *bio);
int pblk_submit_read_gc(struct pblk *pblk, u64 *lba_list, void *data,
unsigned int nr_secs, unsigned int *secs_to_gc,
struct pblk_line *line);
int pblk_submit_read_gc(struct pblk *pblk, struct pblk_gc_rq *gc_rq);
/*
* pblk recovery
*/
@ -815,7 +821,7 @@ int pblk_recov_setup_rq(struct pblk *pblk, struct pblk_c_ctx *c_ctx,
* pblk gc
*/
#define PBLK_GC_MAX_READERS 8 /* Max number of outstanding GC reader jobs */
#define PBLK_GC_W_QD 128 /* Queue depth for inflight GC write I/Os */
#define PBLK_GC_RQ_QD 128 /* Queue depth for inflight GC requests */
#define PBLK_GC_L_QD 4 /* Queue depth for inflight GC lines */
#define PBLK_GC_RSV_LINE 1 /* Reserved lines for GC */
@ -824,7 +830,7 @@ void pblk_gc_exit(struct pblk *pblk);
void pblk_gc_should_start(struct pblk *pblk);
void pblk_gc_should_stop(struct pblk *pblk);
void pblk_gc_should_kick(struct pblk *pblk);
void pblk_gc_kick(struct pblk *pblk);
void pblk_gc_free_full_lines(struct pblk *pblk);
void pblk_gc_sysfs_state_show(struct pblk *pblk, int *gc_enabled,
int *gc_active);
int pblk_gc_sysfs_force(struct pblk *pblk, int force);
@ -834,8 +840,8 @@ int pblk_gc_sysfs_force(struct pblk *pblk, int force);
*/
void pblk_rl_init(struct pblk_rl *rl, int budget);
void pblk_rl_free(struct pblk_rl *rl);
void pblk_rl_update_rates(struct pblk_rl *rl);
int pblk_rl_high_thrs(struct pblk_rl *rl);
int pblk_rl_low_thrs(struct pblk_rl *rl);
unsigned long pblk_rl_nr_free_blks(struct pblk_rl *rl);
int pblk_rl_user_may_insert(struct pblk_rl *rl, int nr_entries);
void pblk_rl_inserted(struct pblk_rl *rl, int nr_entries);
@ -843,10 +849,9 @@ void pblk_rl_user_in(struct pblk_rl *rl, int nr_entries);
int pblk_rl_gc_may_insert(struct pblk_rl *rl, int nr_entries);
void pblk_rl_gc_in(struct pblk_rl *rl, int nr_entries);
void pblk_rl_out(struct pblk_rl *rl, int nr_user, int nr_gc);
int pblk_rl_sysfs_rate_show(struct pblk_rl *rl);
int pblk_rl_max_io(struct pblk_rl *rl);
void pblk_rl_free_lines_inc(struct pblk_rl *rl, struct pblk_line *line);
void pblk_rl_free_lines_dec(struct pblk_rl *rl, struct pblk_line *line);
void pblk_rl_set_space_limit(struct pblk_rl *rl, int entries_left);
int pblk_rl_is_limit(struct pblk_rl *rl);
/*
@ -892,13 +897,7 @@ static inline void *emeta_to_vsc(struct pblk *pblk, struct line_emeta *emeta)
static inline int pblk_line_vsc(struct pblk_line *line)
{
int vsc;
spin_lock(&line->lock);
vsc = le32_to_cpu(*line->vsc);
spin_unlock(&line->lock);
return vsc;
return le32_to_cpu(*line->vsc);
}
#define NVM_MEM_PAGE_WRITE (8)
@ -1140,7 +1139,7 @@ static inline int pblk_set_progr_mode(struct pblk *pblk, int type)
flags = geo->plane_mode >> 1;
if (type == WRITE)
if (type == PBLK_WRITE)
flags |= NVM_IO_SCRAMBLE_ENABLE;
return flags;
@ -1200,7 +1199,6 @@ static inline void pblk_print_failed_rqd(struct pblk *pblk, struct nvm_rq *rqd,
pr_err("error:%d, ppa_status:%llx\n", error, rqd->ppa_status);
}
#endif
static inline int pblk_boundary_ppa_checks(struct nvm_tgt_dev *tgt_dev,
struct ppa_addr *ppas, int nr_ppas)
@ -1221,14 +1219,50 @@ static inline int pblk_boundary_ppa_checks(struct nvm_tgt_dev *tgt_dev,
ppa->g.sec < geo->sec_per_pg)
continue;
#ifdef CONFIG_NVM_DEBUG
print_ppa(ppa, "boundary", i);
#endif
return 1;
}
return 0;
}
static inline int pblk_check_io(struct pblk *pblk, struct nvm_rq *rqd)
{
struct nvm_tgt_dev *dev = pblk->dev;
struct ppa_addr *ppa_list;
ppa_list = (rqd->nr_ppas > 1) ? rqd->ppa_list : &rqd->ppa_addr;
if (pblk_boundary_ppa_checks(dev, ppa_list, rqd->nr_ppas)) {
WARN_ON(1);
return -EINVAL;
}
if (rqd->opcode == NVM_OP_PWRITE) {
struct pblk_line *line;
struct ppa_addr ppa;
int i;
for (i = 0; i < rqd->nr_ppas; i++) {
ppa = ppa_list[i];
line = &pblk->lines[pblk_dev_ppa_to_line(ppa)];
spin_lock(&line->lock);
if (line->state != PBLK_LINESTATE_OPEN) {
pr_err("pblk: bad ppa: line:%d,state:%d\n",
line->id, line->state);
WARN_ON(1);
spin_unlock(&line->lock);
return -EINVAL;
}
spin_unlock(&line->lock);
}
}
return 0;
}
#endif
static inline int pblk_boundary_paddr_checks(struct pblk *pblk, u64 paddr)
{
struct pblk_line_meta *lm = &pblk->lm;

View File

@ -407,7 +407,8 @@ long bch_bucket_alloc(struct cache *ca, unsigned reserve, bool wait)
finish_wait(&ca->set->bucket_wait, &w);
out:
wake_up_process(ca->alloc_thread);
if (ca->alloc_thread)
wake_up_process(ca->alloc_thread);
trace_bcache_alloc(ca, reserve);
@ -442,6 +443,11 @@ out:
b->prio = INITIAL_PRIO;
}
if (ca->set->avail_nbuckets > 0) {
ca->set->avail_nbuckets--;
bch_update_bucket_in_use(ca->set, &ca->set->gc_stats);
}
return r;
}
@ -449,6 +455,11 @@ void __bch_bucket_free(struct cache *ca, struct bucket *b)
{
SET_GC_MARK(b, 0);
SET_GC_SECTORS_USED(b, 0);
if (ca->set->avail_nbuckets < ca->set->nbuckets) {
ca->set->avail_nbuckets++;
bch_update_bucket_in_use(ca->set, &ca->set->gc_stats);
}
}
void bch_bucket_free(struct cache_set *c, struct bkey *k)
@ -601,7 +612,7 @@ bool bch_alloc_sectors(struct cache_set *c, struct bkey *k, unsigned sectors,
/*
* If we had to allocate, we might race and not need to allocate the
* second time we call find_data_bucket(). If we allocated a bucket but
* second time we call pick_data_bucket(). If we allocated a bucket but
* didn't use it, drop the refcount bch_bucket_alloc_set() took:
*/
if (KEY_PTRS(&alloc.key))

View File

@ -185,6 +185,7 @@
#include <linux/mutex.h>
#include <linux/rbtree.h>
#include <linux/rwsem.h>
#include <linux/refcount.h>
#include <linux/types.h>
#include <linux/workqueue.h>
@ -266,9 +267,6 @@ struct bcache_device {
atomic_t *stripe_sectors_dirty;
unsigned long *full_dirty_stripes;
unsigned long sectors_dirty_last;
long sectors_dirty_derivative;
struct bio_set *bio_split;
unsigned data_csum:1;
@ -300,7 +298,7 @@ struct cached_dev {
struct semaphore sb_write_mutex;
/* Refcount on the cache set. Always nonzero when we're caching. */
atomic_t count;
refcount_t count;
struct work_struct detach;
/*
@ -363,12 +361,14 @@ struct cached_dev {
uint64_t writeback_rate_target;
int64_t writeback_rate_proportional;
int64_t writeback_rate_derivative;
int64_t writeback_rate_change;
int64_t writeback_rate_integral;
int64_t writeback_rate_integral_scaled;
int32_t writeback_rate_change;
unsigned writeback_rate_update_seconds;
unsigned writeback_rate_d_term;
unsigned writeback_rate_i_term_inverse;
unsigned writeback_rate_p_term_inverse;
unsigned writeback_rate_minimum;
};
enum alloc_reserve {
@ -582,6 +582,7 @@ struct cache_set {
uint8_t need_gc;
struct gc_stat gc_stats;
size_t nbuckets;
size_t avail_nbuckets;
struct task_struct *gc_thread;
/* Where in the btree gc currently is */
@ -807,13 +808,13 @@ do { \
static inline void cached_dev_put(struct cached_dev *dc)
{
if (atomic_dec_and_test(&dc->count))
if (refcount_dec_and_test(&dc->count))
schedule_work(&dc->detach);
}
static inline bool cached_dev_get(struct cached_dev *dc)
{
if (!atomic_inc_not_zero(&dc->count))
if (!refcount_inc_not_zero(&dc->count))
return false;
/* Paired with the mb in cached_dev_attach */

View File

@ -1241,6 +1241,11 @@ void bch_initial_mark_key(struct cache_set *c, int level, struct bkey *k)
__bch_btree_mark_key(c, level, k);
}
void bch_update_bucket_in_use(struct cache_set *c, struct gc_stat *stats)
{
stats->in_use = (c->nbuckets - c->avail_nbuckets) * 100 / c->nbuckets;
}
static bool btree_gc_mark_node(struct btree *b, struct gc_stat *gc)
{
uint8_t stale = 0;
@ -1652,9 +1657,8 @@ static void btree_gc_start(struct cache_set *c)
mutex_unlock(&c->bucket_lock);
}
static size_t bch_btree_gc_finish(struct cache_set *c)
static void bch_btree_gc_finish(struct cache_set *c)
{
size_t available = 0;
struct bucket *b;
struct cache *ca;
unsigned i;
@ -1691,6 +1695,7 @@ static size_t bch_btree_gc_finish(struct cache_set *c)
}
rcu_read_unlock();
c->avail_nbuckets = 0;
for_each_cache(ca, c, i) {
uint64_t *i;
@ -1712,18 +1717,16 @@ static size_t bch_btree_gc_finish(struct cache_set *c)
BUG_ON(!GC_MARK(b) && GC_SECTORS_USED(b));
if (!GC_MARK(b) || GC_MARK(b) == GC_MARK_RECLAIMABLE)
available++;
c->avail_nbuckets++;
}
}
mutex_unlock(&c->bucket_lock);
return available;
}
static void bch_btree_gc(struct cache_set *c)
{
int ret;
unsigned long available;
struct gc_stat stats;
struct closure writes;
struct btree_op op;
@ -1746,14 +1749,14 @@ static void bch_btree_gc(struct cache_set *c)
pr_warn("gc failed!");
} while (ret);
available = bch_btree_gc_finish(c);
bch_btree_gc_finish(c);
wake_up_allocators(c);
bch_time_stats_update(&c->btree_gc_time, start_time);
stats.key_bytes *= sizeof(uint64_t);
stats.data <<= 9;
stats.in_use = (c->nbuckets - available) * 100 / c->nbuckets;
bch_update_bucket_in_use(c, &stats);
memcpy(&c->gc_stats, &stats, sizeof(struct gc_stat));
trace_bcache_gc_end(c);

View File

@ -306,5 +306,5 @@ void bch_keybuf_del(struct keybuf *, struct keybuf_key *);
struct keybuf_key *bch_keybuf_next(struct keybuf *);
struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *, struct keybuf *,
struct bkey *, keybuf_pred_fn *);
void bch_update_bucket_in_use(struct cache_set *c, struct gc_stat *stats);
#endif

View File

@ -252,6 +252,12 @@ static inline void set_closure_fn(struct closure *cl, closure_fn *fn,
static inline void closure_queue(struct closure *cl)
{
struct workqueue_struct *wq = cl->wq;
/**
* Changes made to closure, work_struct, or a couple of other structs
* may cause work.func not pointing to the right location.
*/
BUILD_BUG_ON(offsetof(struct closure, fn)
!= offsetof(struct work_struct, func));
if (wq) {
INIT_WORK(&cl->work, cl->work.func);
BUG_ON(!queue_work(wq, &cl->work));

View File

@ -27,12 +27,12 @@ struct kmem_cache *bch_search_cache;
static void bch_data_insert_start(struct closure *);
static unsigned cache_mode(struct cached_dev *dc, struct bio *bio)
static unsigned cache_mode(struct cached_dev *dc)
{
return BDEV_CACHE_MODE(&dc->sb);
}
static bool verify(struct cached_dev *dc, struct bio *bio)
static bool verify(struct cached_dev *dc)
{
return dc->verify;
}
@ -370,7 +370,7 @@ static struct hlist_head *iohash(struct cached_dev *dc, uint64_t k)
static bool check_should_bypass(struct cached_dev *dc, struct bio *bio)
{
struct cache_set *c = dc->disk.c;
unsigned mode = cache_mode(dc, bio);
unsigned mode = cache_mode(dc);
unsigned sectors, congested = bch_get_congested(c);
struct task_struct *task = current;
struct io *i;
@ -385,6 +385,14 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio)
op_is_write(bio_op(bio))))
goto skip;
/*
* Flag for bypass if the IO is for read-ahead or background,
* unless the read-ahead request is for metadata (eg, for gfs2).
*/
if (bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) &&
!(bio->bi_opf & REQ_META))
goto skip;
if (bio->bi_iter.bi_sector & (c->sb.block_size - 1) ||
bio_sectors(bio) & (c->sb.block_size - 1)) {
pr_debug("skipping unaligned io");
@ -463,6 +471,7 @@ struct search {
unsigned recoverable:1;
unsigned write:1;
unsigned read_dirty_data:1;
unsigned cache_missed:1;
unsigned long start_time;
@ -649,6 +658,7 @@ static inline struct search *search_alloc(struct bio *bio,
s->orig_bio = bio;
s->cache_miss = NULL;
s->cache_missed = 0;
s->d = d;
s->recoverable = 1;
s->write = op_is_write(bio_op(bio));
@ -698,8 +708,16 @@ static void cached_dev_read_error(struct closure *cl)
{
struct search *s = container_of(cl, struct search, cl);
struct bio *bio = &s->bio.bio;
struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
if (s->recoverable) {
/*
* If cache device is dirty (dc->has_dirty is non-zero), then
* recovery a failed read request from cached device may get a
* stale data back. So read failure recovery is only permitted
* when cache device is clean.
*/
if (s->recoverable &&
(dc && !atomic_read(&dc->has_dirty))) {
/* Retry from the backing device: */
trace_bcache_read_retry(s->orig_bio);
@ -740,7 +758,7 @@ static void cached_dev_read_done(struct closure *cl)
s->cache_miss = NULL;
}
if (verify(dc, &s->bio.bio) && s->recoverable && !s->read_dirty_data)
if (verify(dc) && s->recoverable && !s->read_dirty_data)
bch_data_verify(dc, s->orig_bio);
bio_complete(s);
@ -760,12 +778,12 @@ static void cached_dev_read_done_bh(struct closure *cl)
struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
bch_mark_cache_accounting(s->iop.c, s->d,
!s->cache_miss, s->iop.bypass);
!s->cache_missed, s->iop.bypass);
trace_bcache_read(s->orig_bio, !s->cache_miss, s->iop.bypass);
if (s->iop.status)
continue_at_nobarrier(cl, cached_dev_read_error, bcache_wq);
else if (s->iop.bio || verify(dc, &s->bio.bio))
else if (s->iop.bio || verify(dc))
continue_at_nobarrier(cl, cached_dev_read_done, bcache_wq);
else
continue_at_nobarrier(cl, cached_dev_bio_complete, NULL);
@ -779,6 +797,8 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
struct bio *miss, *cache_bio;
s->cache_missed = 1;
if (s->cache_miss || s->iop.bypass) {
miss = bio_next_split(bio, sectors, GFP_NOIO, s->d->bio_split);
ret = miss == bio ? MAP_DONE : MAP_CONTINUE;
@ -892,7 +912,7 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
s->iop.bypass = true;
if (should_writeback(dc, s->orig_bio,
cache_mode(dc, bio),
cache_mode(dc),
s->iop.bypass)) {
s->iop.bypass = false;
s->iop.writeback = true;

View File

@ -53,12 +53,15 @@ LIST_HEAD(bch_cache_sets);
static LIST_HEAD(uncached_devices);
static int bcache_major;
static DEFINE_IDA(bcache_minor);
static DEFINE_IDA(bcache_device_idx);
static wait_queue_head_t unregister_wait;
struct workqueue_struct *bcache_wq;
#define BTREE_MAX_PAGES (256 * 1024 / PAGE_SIZE)
#define BCACHE_MINORS 16 /* partition support */
/* limitation of partitions number on single bcache device */
#define BCACHE_MINORS 128
/* limitation of bcache devices number on single system */
#define BCACHE_DEVICE_IDX_MAX ((1U << MINORBITS)/BCACHE_MINORS)
/* Superblock */
@ -721,6 +724,16 @@ static void bcache_device_attach(struct bcache_device *d, struct cache_set *c,
closure_get(&c->caching);
}
static inline int first_minor_to_idx(int first_minor)
{
return (first_minor/BCACHE_MINORS);
}
static inline int idx_to_first_minor(int idx)
{
return (idx * BCACHE_MINORS);
}
static void bcache_device_free(struct bcache_device *d)
{
lockdep_assert_held(&bch_register_lock);
@ -734,7 +747,8 @@ static void bcache_device_free(struct bcache_device *d)
if (d->disk && d->disk->queue)
blk_cleanup_queue(d->disk->queue);
if (d->disk) {
ida_simple_remove(&bcache_minor, d->disk->first_minor);
ida_simple_remove(&bcache_device_idx,
first_minor_to_idx(d->disk->first_minor));
put_disk(d->disk);
}
@ -751,7 +765,7 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
{
struct request_queue *q;
size_t n;
int minor;
int idx;
if (!d->stripe_size)
d->stripe_size = 1 << 31;
@ -776,25 +790,24 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
if (!d->full_dirty_stripes)
return -ENOMEM;
minor = ida_simple_get(&bcache_minor, 0, MINORMASK + 1, GFP_KERNEL);
if (minor < 0)
return minor;
minor *= BCACHE_MINORS;
idx = ida_simple_get(&bcache_device_idx, 0,
BCACHE_DEVICE_IDX_MAX, GFP_KERNEL);
if (idx < 0)
return idx;
if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio),
BIOSET_NEED_BVECS |
BIOSET_NEED_RESCUER)) ||
!(d->disk = alloc_disk(BCACHE_MINORS))) {
ida_simple_remove(&bcache_minor, minor);
ida_simple_remove(&bcache_device_idx, idx);
return -ENOMEM;
}
set_capacity(d->disk, sectors);
snprintf(d->disk->disk_name, DISK_NAME_LEN, "bcache%i", minor);
snprintf(d->disk->disk_name, DISK_NAME_LEN, "bcache%i", idx);
d->disk->major = bcache_major;
d->disk->first_minor = minor;
d->disk->first_minor = idx_to_first_minor(idx);
d->disk->fops = &bcache_ops;
d->disk->private_data = d;
@ -889,7 +902,7 @@ static void cached_dev_detach_finish(struct work_struct *w)
closure_init_stack(&cl);
BUG_ON(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags));
BUG_ON(atomic_read(&dc->count));
BUG_ON(refcount_read(&dc->count));
mutex_lock(&bch_register_lock);
@ -1016,7 +1029,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
* dc->c must be set before dc->count != 0 - paired with the mb in
* cached_dev_get()
*/
atomic_set(&dc->count, 1);
refcount_set(&dc->count, 1);
/* Block writeback thread, but spawn it */
down_write(&dc->writeback_lock);
@ -1028,7 +1041,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
if (BDEV_STATE(&dc->sb) == BDEV_STATE_DIRTY) {
bch_sectors_dirty_init(&dc->disk);
atomic_set(&dc->has_dirty, 1);
atomic_inc(&dc->count);
refcount_inc(&dc->count);
bch_writeback_queue(dc);
}
@ -1129,9 +1142,6 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size)
if (ret)
return ret;
set_capacity(dc->disk.disk,
dc->bdev->bd_part->nr_sects - dc->sb.data_offset);
dc->disk.disk->queue->backing_dev_info->ra_pages =
max(dc->disk.disk->queue->backing_dev_info->ra_pages,
q->backing_dev_info->ra_pages);
@ -2085,6 +2095,7 @@ static void bcache_exit(void)
if (bcache_major)
unregister_blkdev(bcache_major, "bcache");
unregister_reboot_notifier(&reboot);
mutex_destroy(&bch_register_lock);
}
static int __init bcache_init(void)
@ -2103,14 +2114,15 @@ static int __init bcache_init(void)
bcache_major = register_blkdev(0, "bcache");
if (bcache_major < 0) {
unregister_reboot_notifier(&reboot);
mutex_destroy(&bch_register_lock);
return bcache_major;
}
if (!(bcache_wq = alloc_workqueue("bcache", WQ_MEM_RECLAIM, 0)) ||
!(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) ||
sysfs_create_files(bcache_kobj, files) ||
bch_request_init() ||
bch_debug_init(bcache_kobj))
bch_debug_init(bcache_kobj) ||
sysfs_create_files(bcache_kobj, files))
goto err;
return 0;

View File

@ -82,8 +82,9 @@ rw_attribute(writeback_delay);
rw_attribute(writeback_rate);
rw_attribute(writeback_rate_update_seconds);
rw_attribute(writeback_rate_d_term);
rw_attribute(writeback_rate_i_term_inverse);
rw_attribute(writeback_rate_p_term_inverse);
rw_attribute(writeback_rate_minimum);
read_attribute(writeback_rate_debug);
read_attribute(stripe_size);
@ -131,15 +132,16 @@ SHOW(__bch_cached_dev)
sysfs_hprint(writeback_rate, dc->writeback_rate.rate << 9);
var_print(writeback_rate_update_seconds);
var_print(writeback_rate_d_term);
var_print(writeback_rate_i_term_inverse);
var_print(writeback_rate_p_term_inverse);
var_print(writeback_rate_minimum);
if (attr == &sysfs_writeback_rate_debug) {
char rate[20];
char dirty[20];
char target[20];
char proportional[20];
char derivative[20];
char integral[20];
char change[20];
s64 next_io;
@ -147,7 +149,7 @@ SHOW(__bch_cached_dev)
bch_hprint(dirty, bcache_dev_sectors_dirty(&dc->disk) << 9);
bch_hprint(target, dc->writeback_rate_target << 9);
bch_hprint(proportional,dc->writeback_rate_proportional << 9);
bch_hprint(derivative, dc->writeback_rate_derivative << 9);
bch_hprint(integral, dc->writeback_rate_integral_scaled << 9);
bch_hprint(change, dc->writeback_rate_change << 9);
next_io = div64_s64(dc->writeback_rate.next - local_clock(),
@ -158,11 +160,11 @@ SHOW(__bch_cached_dev)
"dirty:\t\t%s\n"
"target:\t\t%s\n"
"proportional:\t%s\n"
"derivative:\t%s\n"
"integral:\t%s\n"
"change:\t\t%s/sec\n"
"next io:\t%llims\n",
rate, dirty, target, proportional,
derivative, change, next_io);
integral, change, next_io);
}
sysfs_hprint(dirty_data,
@ -214,7 +216,7 @@ STORE(__cached_dev)
dc->writeback_rate.rate, 1, INT_MAX);
d_strtoul_nonzero(writeback_rate_update_seconds);
d_strtoul(writeback_rate_d_term);
d_strtoul(writeback_rate_i_term_inverse);
d_strtoul_nonzero(writeback_rate_p_term_inverse);
d_strtoi_h(sequential_cutoff);
@ -320,7 +322,7 @@ static struct attribute *bch_cached_dev_files[] = {
&sysfs_writeback_percent,
&sysfs_writeback_rate,
&sysfs_writeback_rate_update_seconds,
&sysfs_writeback_rate_d_term,
&sysfs_writeback_rate_i_term_inverse,
&sysfs_writeback_rate_p_term_inverse,
&sysfs_writeback_rate_debug,
&sysfs_dirty_data,
@ -746,6 +748,11 @@ static struct attribute *bch_cache_set_internal_files[] = {
};
KTYPE(bch_cache_set_internal);
static int __bch_cache_cmp(const void *l, const void *r)
{
return *((uint16_t *)r) - *((uint16_t *)l);
}
SHOW(__bch_cache)
{
struct cache *ca = container_of(kobj, struct cache, kobj);
@ -770,9 +777,6 @@ SHOW(__bch_cache)
CACHE_REPLACEMENT(&ca->sb));
if (attr == &sysfs_priority_stats) {
int cmp(const void *l, const void *r)
{ return *((uint16_t *) r) - *((uint16_t *) l); }
struct bucket *b;
size_t n = ca->sb.nbuckets, i;
size_t unused = 0, available = 0, dirty = 0, meta = 0;
@ -801,7 +805,7 @@ SHOW(__bch_cache)
p[i] = ca->buckets[i].prio;
mutex_unlock(&ca->set->bucket_lock);
sort(p, n, sizeof(uint16_t), cmp, NULL);
sort(p, n, sizeof(uint16_t), __bch_cache_cmp, NULL);
while (n &&
!cached[n - 1])

View File

@ -232,8 +232,14 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done)
d->next += div_u64(done * NSEC_PER_SEC, d->rate);
if (time_before64(now + NSEC_PER_SEC, d->next))
d->next = now + NSEC_PER_SEC;
/* Bound the time. Don't let us fall further than 2 seconds behind
* (this prevents unnecessary backlog that would make it impossible
* to catch up). If we're ahead of the desired writeback rate,
* don't let us sleep more than 2.5 seconds (so we can notice/respond
* if the control system tells us to speed up!).
*/
if (time_before64(now + NSEC_PER_SEC * 5LLU / 2LLU, d->next))
d->next = now + NSEC_PER_SEC * 5LLU / 2LLU;
if (time_after64(now - NSEC_PER_SEC * 2, d->next))
d->next = now - NSEC_PER_SEC * 2;

View File

@ -442,10 +442,10 @@ struct bch_ratelimit {
uint64_t next;
/*
* Rate at which we want to do work, in units per nanosecond
* Rate at which we want to do work, in units per second
* The units here correspond to the units passed to bch_next_delay()
*/
unsigned rate;
uint32_t rate;
};
static inline void bch_ratelimit_reset(struct bch_ratelimit *d)

View File

@ -26,48 +26,63 @@ static void __update_writeback_rate(struct cached_dev *dc)
bcache_flash_devs_sectors_dirty(c);
uint64_t cache_dirty_target =
div_u64(cache_sectors * dc->writeback_percent, 100);
int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
c->cached_dev_sectors);
/* PD controller */
/*
* PI controller:
* Figures out the amount that should be written per second.
*
* First, the error (number of sectors that are dirty beyond our
* target) is calculated. The error is accumulated (numerically
* integrated).
*
* Then, the proportional value and integral value are scaled
* based on configured values. These are stored as inverses to
* avoid fixed point math and to make configuration easy-- e.g.
* the default value of 40 for writeback_rate_p_term_inverse
* attempts to write at a rate that would retire all the dirty
* blocks in 40 seconds.
*
* The writeback_rate_i_inverse value of 10000 means that 1/10000th
* of the error is accumulated in the integral term per second.
* This acts as a slow, long-term average that is not subject to
* variations in usage like the p term.
*/
int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
int64_t derivative = dirty - dc->disk.sectors_dirty_last;
int64_t proportional = dirty - target;
int64_t change;
int64_t error = dirty - target;
int64_t proportional_scaled =
div_s64(error, dc->writeback_rate_p_term_inverse);
int64_t integral_scaled;
uint32_t new_rate;
dc->disk.sectors_dirty_last = dirty;
if ((error < 0 && dc->writeback_rate_integral > 0) ||
(error > 0 && time_before64(local_clock(),
dc->writeback_rate.next + NSEC_PER_MSEC))) {
/*
* Only decrease the integral term if it's more than
* zero. Only increase the integral term if the device
* is keeping up. (Don't wind up the integral
* ineffectively in either case).
*
* It's necessary to scale this by
* writeback_rate_update_seconds to keep the integral
* term dimensioned properly.
*/
dc->writeback_rate_integral += error *
dc->writeback_rate_update_seconds;
}
/* Scale to sectors per second */
integral_scaled = div_s64(dc->writeback_rate_integral,
dc->writeback_rate_i_term_inverse);
proportional *= dc->writeback_rate_update_seconds;
proportional = div_s64(proportional, dc->writeback_rate_p_term_inverse);
new_rate = clamp_t(int32_t, (proportional_scaled + integral_scaled),
dc->writeback_rate_minimum, NSEC_PER_SEC);
derivative = div_s64(derivative, dc->writeback_rate_update_seconds);
derivative = ewma_add(dc->disk.sectors_dirty_derivative, derivative,
(dc->writeback_rate_d_term /
dc->writeback_rate_update_seconds) ?: 1, 0);
derivative *= dc->writeback_rate_d_term;
derivative = div_s64(derivative, dc->writeback_rate_p_term_inverse);
change = proportional + derivative;
/* Don't increase writeback rate if the device isn't keeping up */
if (change > 0 &&
time_after64(local_clock(),
dc->writeback_rate.next + NSEC_PER_MSEC))
change = 0;
dc->writeback_rate.rate =
clamp_t(int64_t, (int64_t) dc->writeback_rate.rate + change,
1, NSEC_PER_MSEC);
dc->writeback_rate_proportional = proportional;
dc->writeback_rate_derivative = derivative;
dc->writeback_rate_change = change;
dc->writeback_rate_proportional = proportional_scaled;
dc->writeback_rate_integral_scaled = integral_scaled;
dc->writeback_rate_change = new_rate - dc->writeback_rate.rate;
dc->writeback_rate.rate = new_rate;
dc->writeback_rate_target = target;
}
@ -180,13 +195,21 @@ static void write_dirty(struct closure *cl)
struct dirty_io *io = container_of(cl, struct dirty_io, cl);
struct keybuf_key *w = io->bio.bi_private;
dirty_init(w);
bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0);
io->bio.bi_iter.bi_sector = KEY_START(&w->key);
bio_set_dev(&io->bio, io->dc->bdev);
io->bio.bi_end_io = dirty_endio;
/*
* IO errors are signalled using the dirty bit on the key.
* If we failed to read, we should not attempt to write to the
* backing device. Instead, immediately go to write_dirty_finish
* to clean up.
*/
if (KEY_DIRTY(&w->key)) {
dirty_init(w);
bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0);
io->bio.bi_iter.bi_sector = KEY_START(&w->key);
bio_set_dev(&io->bio, io->dc->bdev);
io->bio.bi_end_io = dirty_endio;
closure_bio_submit(&io->bio, cl);
closure_bio_submit(&io->bio, cl);
}
continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq);
}
@ -418,6 +441,8 @@ static int bch_writeback_thread(void *arg)
struct cached_dev *dc = arg;
bool searched_full_index;
bch_ratelimit_reset(&dc->writeback_rate);
while (!kthread_should_stop()) {
down_write(&dc->writeback_lock);
if (!atomic_read(&dc->has_dirty) ||
@ -445,7 +470,6 @@ static int bch_writeback_thread(void *arg)
up_write(&dc->writeback_lock);
bch_ratelimit_reset(&dc->writeback_rate);
read_dirty(dc);
if (searched_full_index) {
@ -455,6 +479,8 @@ static int bch_writeback_thread(void *arg)
!kthread_should_stop() &&
!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags))
delay = schedule_timeout_interruptible(delay);
bch_ratelimit_reset(&dc->writeback_rate);
}
}
@ -492,8 +518,6 @@ void bch_sectors_dirty_init(struct bcache_device *d)
bch_btree_map_keys(&op.op, d->c, &KEY(op.inode, 0, 0),
sectors_dirty_init_fn, 0);
d->sectors_dirty_last = bcache_dev_sectors_dirty(d);
}
void bch_cached_dev_writeback_init(struct cached_dev *dc)
@ -507,10 +531,11 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_percent = 10;
dc->writeback_delay = 30;
dc->writeback_rate.rate = 1024;
dc->writeback_rate_minimum = 8;
dc->writeback_rate_update_seconds = 5;
dc->writeback_rate_d_term = 30;
dc->writeback_rate_p_term_inverse = 6000;
dc->writeback_rate_p_term_inverse = 40;
dc->writeback_rate_i_term_inverse = 10000;
INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate);
}

View File

@ -77,7 +77,9 @@ static inline bool should_writeback(struct cached_dev *dc, struct bio *bio,
if (would_skip)
return false;
return op_is_sync(bio->bi_opf) || in_use <= CUTOFF_WRITEBACK;
return (op_is_sync(bio->bi_opf) ||
bio->bi_opf & (REQ_META|REQ_PRIO) ||
in_use <= CUTOFF_WRITEBACK);
}
static inline void bch_writeback_queue(struct cached_dev *dc)
@ -90,7 +92,7 @@ static inline void bch_writeback_add(struct cached_dev *dc)
{
if (!atomic_read(&dc->has_dirty) &&
!atomic_xchg(&dc->has_dirty, 1)) {
atomic_inc(&dc->count);
refcount_inc(&dc->count);
if (BDEV_STATE(&dc->sb) != BDEV_STATE_DIRTY) {
SET_BDEV_STATE(&dc->sb, BDEV_STATE_DIRTY);

View File

@ -368,7 +368,7 @@ static int read_page(struct file *file, unsigned long index,
pr_debug("read bitmap file (%dB @ %llu)\n", (int)PAGE_SIZE,
(unsigned long long)index << PAGE_SHIFT);
bh = alloc_page_buffers(page, 1<<inode->i_blkbits, 0);
bh = alloc_page_buffers(page, 1<<inode->i_blkbits, false);
if (!bh) {
ret = -ENOMEM;
goto out;

View File

@ -56,7 +56,7 @@ static unsigned dm_get_blk_mq_queue_depth(void)
int dm_request_based(struct mapped_device *md)
{
return blk_queue_stackable(md->queue);
return queue_is_rq_based(md->queue);
}
static void dm_old_start_queue(struct request_queue *q)

View File

@ -1000,7 +1000,7 @@ verify_rq_based:
list_for_each_entry(dd, devices, list) {
struct request_queue *q = bdev_get_queue(dd->dm_dev->bdev);
if (!blk_queue_stackable(q)) {
if (!queue_is_rq_based(q)) {
DMERR("table load rejected: including"
" non-request-stackable devices");
return -EINVAL;
@ -1847,19 +1847,6 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
*/
if (blk_queue_add_random(q) && dm_table_all_devices_attribute(t, device_is_not_random))
queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, q);
/*
* QUEUE_FLAG_STACKABLE must be set after all queue settings are
* visible to other CPUs because, once the flag is set, incoming bios
* are processed by request-based dm, which refers to the queue
* settings.
* Until the flag set, bios are passed to bio-based dm and queued to
* md->deferred where queue settings are not needed yet.
* Those bios are passed to request-based dm at the resume time.
*/
smp_mb();
if (dm_table_request_based(t))
queue_flag_set_unlocked(QUEUE_FLAG_STACKABLE, q);
}
unsigned int dm_table_get_num_targets(struct dm_table *t)

View File

@ -1618,17 +1618,6 @@ static void dm_wq_work(struct work_struct *work);
void dm_init_md_queue(struct mapped_device *md)
{
/*
* Request-based dm devices cannot be stacked on top of bio-based dm
* devices. The type of this dm device may not have been decided yet.
* The type is decided at the first table loading time.
* To prevent problematic device stacking, clear the queue flag
* for request stacking support until then.
*
* This queue is new, so no concurrency on the queue_flags.
*/
queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
/*
* Initialize data that will only be used by a non-blk-mq DM queue
* - must do so here (in alloc_dev callchain) before queue is used

View File

@ -1,2 +1,6 @@
menu "NVME Support"
source "drivers/nvme/host/Kconfig"
source "drivers/nvme/target/Kconfig"
endmenu

View File

@ -13,6 +13,15 @@ config BLK_DEV_NVME
To compile this driver as a module, choose M here: the
module will be called nvme.
config NVME_MULTIPATH
bool "NVMe multipath support"
depends on NVME_CORE
---help---
This option enables support for multipath access to NVMe
subsystems. If this option is enabled only a single
/dev/nvmeXnY device will show up for each NVMe namespaces,
even if it is accessible through multiple controllers.
config NVME_FABRICS
tristate

View File

@ -6,6 +6,7 @@ obj-$(CONFIG_NVME_RDMA) += nvme-rdma.o
obj-$(CONFIG_NVME_FC) += nvme-fc.o
nvme-core-y := core.o
nvme-core-$(CONFIG_NVME_MULTIPATH) += multipath.o
nvme-core-$(CONFIG_NVM) += lightnvm.o
nvme-y += pci.o

File diff suppressed because it is too large Load Diff

View File

@ -548,6 +548,7 @@ static const match_table_t opt_tokens = {
{ NVMF_OPT_HOSTNQN, "hostnqn=%s" },
{ NVMF_OPT_HOST_TRADDR, "host_traddr=%s" },
{ NVMF_OPT_HOST_ID, "hostid=%s" },
{ NVMF_OPT_DUP_CONNECT, "duplicate_connect" },
{ NVMF_OPT_ERR, NULL }
};
@ -566,6 +567,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
opts->nr_io_queues = num_online_cpus();
opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
opts->kato = NVME_DEFAULT_KATO;
opts->duplicate_connect = false;
options = o = kstrdup(buf, GFP_KERNEL);
if (!options)
@ -742,6 +744,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
goto out;
}
break;
case NVMF_OPT_DUP_CONNECT:
opts->duplicate_connect = true;
break;
default:
pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
p);
@ -823,7 +828,7 @@ EXPORT_SYMBOL_GPL(nvmf_free_options);
#define NVMF_REQUIRED_OPTS (NVMF_OPT_TRANSPORT | NVMF_OPT_NQN)
#define NVMF_ALLOWED_OPTS (NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
NVMF_OPT_HOST_ID)
NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT)
static struct nvme_ctrl *
nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
@ -841,6 +846,9 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
if (ret)
goto out_free_opts;
request_module("nvme-%s", opts->transport);
/*
* Check the generic options first as we need a valid transport for
* the lookup below. Then clear the generic flags so that transport
@ -874,12 +882,12 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
goto out_unlock;
}
if (strcmp(ctrl->subnqn, opts->subsysnqn)) {
if (strcmp(ctrl->subsys->subnqn, opts->subsysnqn)) {
dev_warn(ctrl->device,
"controller returned incorrect NQN: \"%s\".\n",
ctrl->subnqn);
ctrl->subsys->subnqn);
up_read(&nvmf_transports_rwsem);
ctrl->ops->delete_ctrl(ctrl);
nvme_delete_ctrl_sync(ctrl);
return ERR_PTR(-EINVAL);
}

View File

@ -57,6 +57,7 @@ enum {
NVMF_OPT_HOST_TRADDR = 1 << 10,
NVMF_OPT_CTRL_LOSS_TMO = 1 << 11,
NVMF_OPT_HOST_ID = 1 << 12,
NVMF_OPT_DUP_CONNECT = 1 << 13,
};
/**
@ -96,6 +97,7 @@ struct nvmf_ctrl_options {
unsigned int nr_io_queues;
unsigned int reconnect_delay;
bool discovery_nqn;
bool duplicate_connect;
unsigned int kato;
struct nvmf_host *host;
int max_reconnects;
@ -131,6 +133,18 @@ struct nvmf_transport_ops {
struct nvmf_ctrl_options *opts);
};
static inline bool
nvmf_ctlr_matches_baseopts(struct nvme_ctrl *ctrl,
struct nvmf_ctrl_options *opts)
{
if (strcmp(opts->subsysnqn, ctrl->opts->subsysnqn) ||
strcmp(opts->host->nqn, ctrl->opts->host->nqn) ||
memcmp(&opts->host->id, &ctrl->opts->host->id, sizeof(uuid_t)))
return false;
return true;
}
int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val);
int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val);

File diff suppressed because it is too large Load Diff

View File

@ -305,7 +305,7 @@ static int nvme_nvm_identity(struct nvm_dev *nvmdev, struct nvm_id *nvm_id)
int ret;
c.identity.opcode = nvme_nvm_admin_identity;
c.identity.nsid = cpu_to_le32(ns->ns_id);
c.identity.nsid = cpu_to_le32(ns->head->ns_id);
c.identity.chnl_off = 0;
nvme_nvm_id = kmalloc(sizeof(struct nvme_nvm_id), GFP_KERNEL);
@ -344,7 +344,7 @@ static int nvme_nvm_get_l2p_tbl(struct nvm_dev *nvmdev, u64 slba, u32 nlb,
int ret = 0;
c.l2p.opcode = nvme_nvm_admin_get_l2p_tbl;
c.l2p.nsid = cpu_to_le32(ns->ns_id);
c.l2p.nsid = cpu_to_le32(ns->head->ns_id);
entries = kmalloc(len, GFP_KERNEL);
if (!entries)
return -ENOMEM;
@ -402,7 +402,7 @@ static int nvme_nvm_get_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr ppa,
int ret = 0;
c.get_bb.opcode = nvme_nvm_admin_get_bb_tbl;
c.get_bb.nsid = cpu_to_le32(ns->ns_id);
c.get_bb.nsid = cpu_to_le32(ns->head->ns_id);
c.get_bb.spba = cpu_to_le64(ppa.ppa);
bb_tbl = kzalloc(tblsz, GFP_KERNEL);
@ -452,7 +452,7 @@ static int nvme_nvm_set_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr *ppas,
int ret = 0;
c.set_bb.opcode = nvme_nvm_admin_set_bb_tbl;
c.set_bb.nsid = cpu_to_le32(ns->ns_id);
c.set_bb.nsid = cpu_to_le32(ns->head->ns_id);
c.set_bb.spba = cpu_to_le64(ppas->ppa);
c.set_bb.nlb = cpu_to_le16(nr_ppas - 1);
c.set_bb.value = type;
@ -469,7 +469,7 @@ static inline void nvme_nvm_rqtocmd(struct nvm_rq *rqd, struct nvme_ns *ns,
struct nvme_nvm_command *c)
{
c->ph_rw.opcode = rqd->opcode;
c->ph_rw.nsid = cpu_to_le32(ns->ns_id);
c->ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
c->ph_rw.spba = cpu_to_le64(rqd->ppa_addr.ppa);
c->ph_rw.metadata = cpu_to_le64(rqd->dma_meta_list);
c->ph_rw.control = cpu_to_le16(rqd->flags);
@ -492,33 +492,46 @@ static void nvme_nvm_end_io(struct request *rq, blk_status_t status)
blk_mq_free_request(rq);
}
static struct request *nvme_nvm_alloc_request(struct request_queue *q,
struct nvm_rq *rqd,
struct nvme_nvm_command *cmd)
{
struct nvme_ns *ns = q->queuedata;
struct request *rq;
nvme_nvm_rqtocmd(rqd, ns, cmd);
rq = nvme_alloc_request(q, (struct nvme_command *)cmd, 0, NVME_QID_ANY);
if (IS_ERR(rq))
return rq;
rq->cmd_flags &= ~REQ_FAILFAST_DRIVER;
if (rqd->bio) {
blk_init_request_from_bio(rq, rqd->bio);
} else {
rq->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM);
rq->__data_len = 0;
}
return rq;
}
static int nvme_nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd)
{
struct request_queue *q = dev->q;
struct nvme_ns *ns = q->queuedata;
struct request *rq;
struct bio *bio = rqd->bio;
struct nvme_nvm_command *cmd;
struct request *rq;
cmd = kzalloc(sizeof(struct nvme_nvm_command), GFP_KERNEL);
if (!cmd)
return -ENOMEM;
nvme_nvm_rqtocmd(rqd, ns, cmd);
rq = nvme_alloc_request(q, (struct nvme_command *)cmd, 0, NVME_QID_ANY);
rq = nvme_nvm_alloc_request(q, rqd, cmd);
if (IS_ERR(rq)) {
kfree(cmd);
return PTR_ERR(rq);
}
rq->cmd_flags &= ~REQ_FAILFAST_DRIVER;
if (bio) {
blk_init_request_from_bio(rq, bio);
} else {
rq->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM);
rq->__data_len = 0;
}
rq->end_io_data = rqd;
@ -527,6 +540,34 @@ static int nvme_nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd)
return 0;
}
static int nvme_nvm_submit_io_sync(struct nvm_dev *dev, struct nvm_rq *rqd)
{
struct request_queue *q = dev->q;
struct request *rq;
struct nvme_nvm_command cmd;
int ret = 0;
memset(&cmd, 0, sizeof(struct nvme_nvm_command));
rq = nvme_nvm_alloc_request(q, rqd, &cmd);
if (IS_ERR(rq))
return PTR_ERR(rq);
/* I/Os can fail and the error is signaled through rqd. Callers must
* handle the error accordingly.
*/
blk_execute_rq(q, NULL, rq, 0);
if (nvme_req(rq)->flags & NVME_REQ_CANCELLED)
ret = -EINTR;
rqd->ppa_status = le64_to_cpu(nvme_req(rq)->result.u64);
rqd->error = nvme_req(rq)->status;
blk_mq_free_request(rq);
return ret;
}
static void *nvme_nvm_create_dma_pool(struct nvm_dev *nvmdev, char *name)
{
struct nvme_ns *ns = nvmdev->q->queuedata;
@ -562,6 +603,7 @@ static struct nvm_dev_ops nvme_nvm_dev_ops = {
.set_bb_tbl = nvme_nvm_set_bb_tbl,
.submit_io = nvme_nvm_submit_io,
.submit_io_sync = nvme_nvm_submit_io_sync,
.create_dma_pool = nvme_nvm_create_dma_pool,
.destroy_dma_pool = nvme_nvm_destroy_dma_pool,
@ -600,8 +642,6 @@ static int nvme_nvm_submit_user_cmd(struct request_queue *q,
rq->timeout = timeout ? timeout : ADMIN_TIMEOUT;
rq->cmd_flags &= ~REQ_FAILFAST_DRIVER;
if (ppa_buf && ppa_len) {
ppa_list = dma_pool_alloc(dev->dma_pool, GFP_KERNEL, &ppa_dma);
if (!ppa_list) {
@ -691,7 +731,7 @@ static int nvme_nvm_submit_vio(struct nvme_ns *ns,
memset(&c, 0, sizeof(c));
c.ph_rw.opcode = vio.opcode;
c.ph_rw.nsid = cpu_to_le32(ns->ns_id);
c.ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
c.ph_rw.control = cpu_to_le16(vio.control);
c.ph_rw.length = cpu_to_le16(vio.nppas);
@ -728,7 +768,7 @@ static int nvme_nvm_user_vcmd(struct nvme_ns *ns, int admin,
memset(&c, 0, sizeof(c));
c.common.opcode = vcmd.opcode;
c.common.nsid = cpu_to_le32(ns->ns_id);
c.common.nsid = cpu_to_le32(ns->head->ns_id);
c.common.cdw2[0] = cpu_to_le32(vcmd.cdw2);
c.common.cdw2[1] = cpu_to_le32(vcmd.cdw3);
/* cdw11-12 */

View File

@ -0,0 +1,291 @@
/*
* Copyright (c) 2017 Christoph Hellwig.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
#include <linux/moduleparam.h>
#include "nvme.h"
static bool multipath = true;
module_param(multipath, bool, 0644);
MODULE_PARM_DESC(multipath,
"turn on native support for multiple controllers per subsystem");
void nvme_failover_req(struct request *req)
{
struct nvme_ns *ns = req->q->queuedata;
unsigned long flags;
spin_lock_irqsave(&ns->head->requeue_lock, flags);
blk_steal_bios(&ns->head->requeue_list, req);
spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
blk_mq_end_request(req, 0);
nvme_reset_ctrl(ns->ctrl);
kblockd_schedule_work(&ns->head->requeue_work);
}
bool nvme_req_needs_failover(struct request *req)
{
if (!(req->cmd_flags & REQ_NVME_MPATH))
return false;
switch (nvme_req(req)->status & 0x7ff) {
/*
* Generic command status:
*/
case NVME_SC_INVALID_OPCODE:
case NVME_SC_INVALID_FIELD:
case NVME_SC_INVALID_NS:
case NVME_SC_LBA_RANGE:
case NVME_SC_CAP_EXCEEDED:
case NVME_SC_RESERVATION_CONFLICT:
return false;
/*
* I/O command set specific error. Unfortunately these values are
* reused for fabrics commands, but those should never get here.
*/
case NVME_SC_BAD_ATTRIBUTES:
case NVME_SC_INVALID_PI:
case NVME_SC_READ_ONLY:
case NVME_SC_ONCS_NOT_SUPPORTED:
WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
nvme_fabrics_command);
return false;
/*
* Media and Data Integrity Errors:
*/
case NVME_SC_WRITE_FAULT:
case NVME_SC_READ_ERROR:
case NVME_SC_GUARD_CHECK:
case NVME_SC_APPTAG_CHECK:
case NVME_SC_REFTAG_CHECK:
case NVME_SC_COMPARE_FAILED:
case NVME_SC_ACCESS_DENIED:
case NVME_SC_UNWRITTEN_BLOCK:
return false;
}
/* Everything else could be a path failure, so should be retried */
return true;
}
void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
{
struct nvme_ns *ns;
mutex_lock(&ctrl->namespaces_mutex);
list_for_each_entry(ns, &ctrl->namespaces, list) {
if (ns->head->disk)
kblockd_schedule_work(&ns->head->requeue_work);
}
mutex_unlock(&ctrl->namespaces_mutex);
}
static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head)
{
struct nvme_ns *ns;
list_for_each_entry_rcu(ns, &head->list, siblings) {
if (ns->ctrl->state == NVME_CTRL_LIVE) {
rcu_assign_pointer(head->current_path, ns);
return ns;
}
}
return NULL;
}
inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
{
struct nvme_ns *ns = srcu_dereference(head->current_path, &head->srcu);
if (unlikely(!ns || ns->ctrl->state != NVME_CTRL_LIVE))
ns = __nvme_find_path(head);
return ns;
}
static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
struct bio *bio)
{
struct nvme_ns_head *head = q->queuedata;
struct device *dev = disk_to_dev(head->disk);
struct nvme_ns *ns;
blk_qc_t ret = BLK_QC_T_NONE;
int srcu_idx;
srcu_idx = srcu_read_lock(&head->srcu);
ns = nvme_find_path(head);
if (likely(ns)) {
bio->bi_disk = ns->disk;
bio->bi_opf |= REQ_NVME_MPATH;
ret = direct_make_request(bio);
} else if (!list_empty_careful(&head->list)) {
dev_warn_ratelimited(dev, "no path available - requeing I/O\n");
spin_lock_irq(&head->requeue_lock);
bio_list_add(&head->requeue_list, bio);
spin_unlock_irq(&head->requeue_lock);
} else {
dev_warn_ratelimited(dev, "no path - failing I/O\n");
bio->bi_status = BLK_STS_IOERR;
bio_endio(bio);
}
srcu_read_unlock(&head->srcu, srcu_idx);
return ret;
}
static bool nvme_ns_head_poll(struct request_queue *q, blk_qc_t qc)
{
struct nvme_ns_head *head = q->queuedata;
struct nvme_ns *ns;
bool found = false;
int srcu_idx;
srcu_idx = srcu_read_lock(&head->srcu);
ns = srcu_dereference(head->current_path, &head->srcu);
if (likely(ns && ns->ctrl->state == NVME_CTRL_LIVE))
found = ns->queue->poll_fn(q, qc);
srcu_read_unlock(&head->srcu, srcu_idx);
return found;
}
static void nvme_requeue_work(struct work_struct *work)
{
struct nvme_ns_head *head =
container_of(work, struct nvme_ns_head, requeue_work);
struct bio *bio, *next;
spin_lock_irq(&head->requeue_lock);
next = bio_list_get(&head->requeue_list);
spin_unlock_irq(&head->requeue_lock);
while ((bio = next) != NULL) {
next = bio->bi_next;
bio->bi_next = NULL;
/*
* Reset disk to the mpath node and resubmit to select a new
* path.
*/
bio->bi_disk = head->disk;
generic_make_request(bio);
}
}
int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
{
struct request_queue *q;
bool vwc = false;
bio_list_init(&head->requeue_list);
spin_lock_init(&head->requeue_lock);
INIT_WORK(&head->requeue_work, nvme_requeue_work);
/*
* Add a multipath node if the subsystems supports multiple controllers.
* We also do this for private namespaces as the namespace sharing data could
* change after a rescan.
*/
if (!(ctrl->subsys->cmic & (1 << 1)) || !multipath)
return 0;
q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
if (!q)
goto out;
q->queuedata = head;
blk_queue_make_request(q, nvme_ns_head_make_request);
q->poll_fn = nvme_ns_head_poll;
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
/* set to a default value for 512 until disk is validated */
blk_queue_logical_block_size(q, 512);
/* we need to propagate up the VMC settings */
if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
vwc = true;
blk_queue_write_cache(q, vwc, vwc);
head->disk = alloc_disk(0);
if (!head->disk)
goto out_cleanup_queue;
head->disk->fops = &nvme_ns_head_ops;
head->disk->private_data = head;
head->disk->queue = q;
head->disk->flags = GENHD_FL_EXT_DEVT;
sprintf(head->disk->disk_name, "nvme%dn%d",
ctrl->subsys->instance, head->instance);
return 0;
out_cleanup_queue:
blk_cleanup_queue(q);
out:
return -ENOMEM;
}
void nvme_mpath_add_disk(struct nvme_ns_head *head)
{
if (!head->disk)
return;
device_add_disk(&head->subsys->dev, head->disk);
if (sysfs_create_group(&disk_to_dev(head->disk)->kobj,
&nvme_ns_id_attr_group))
pr_warn("%s: failed to create sysfs group for identification\n",
head->disk->disk_name);
}
void nvme_mpath_add_disk_links(struct nvme_ns *ns)
{
struct kobject *slave_disk_kobj, *holder_disk_kobj;
if (!ns->head->disk)
return;
slave_disk_kobj = &disk_to_dev(ns->disk)->kobj;
if (sysfs_create_link(ns->head->disk->slave_dir, slave_disk_kobj,
kobject_name(slave_disk_kobj)))
return;
holder_disk_kobj = &disk_to_dev(ns->head->disk)->kobj;
if (sysfs_create_link(ns->disk->part0.holder_dir, holder_disk_kobj,
kobject_name(holder_disk_kobj)))
sysfs_remove_link(ns->head->disk->slave_dir,
kobject_name(slave_disk_kobj));
}
void nvme_mpath_remove_disk(struct nvme_ns_head *head)
{
if (!head->disk)
return;
sysfs_remove_group(&disk_to_dev(head->disk)->kobj,
&nvme_ns_id_attr_group);
del_gendisk(head->disk);
blk_set_queue_dying(head->disk->queue);
/* make sure all pending bios are cleaned up */
kblockd_schedule_work(&head->requeue_work);
flush_work(&head->requeue_work);
blk_cleanup_queue(head->disk->queue);
put_disk(head->disk);
}
void nvme_mpath_remove_disk_links(struct nvme_ns *ns)
{
if (!ns->head->disk)
return;
sysfs_remove_link(ns->disk->part0.holder_dir,
kobject_name(&disk_to_dev(ns->head->disk)->kobj));
sysfs_remove_link(ns->head->disk->slave_dir,
kobject_name(&disk_to_dev(ns->disk)->kobj));
}

View File

@ -15,16 +15,17 @@
#define _NVME_H
#include <linux/nvme.h>
#include <linux/cdev.h>
#include <linux/pci.h>
#include <linux/kref.h>
#include <linux/blk-mq.h>
#include <linux/lightnvm.h>
#include <linux/sed-opal.h>
extern unsigned char nvme_io_timeout;
extern unsigned int nvme_io_timeout;
#define NVME_IO_TIMEOUT (nvme_io_timeout * HZ)
extern unsigned char admin_timeout;
extern unsigned int admin_timeout;
#define ADMIN_TIMEOUT (admin_timeout * HZ)
#define NVME_DEFAULT_KATO 5
@ -94,6 +95,11 @@ struct nvme_request {
u16 status;
};
/*
* Mark a bio as coming in through the mpath node.
*/
#define REQ_NVME_MPATH REQ_DRV
enum {
NVME_REQ_CANCELLED = (1 << 0),
};
@ -127,24 +133,23 @@ struct nvme_ctrl {
struct request_queue *admin_q;
struct request_queue *connect_q;
struct device *dev;
struct kref kref;
int instance;
struct blk_mq_tag_set *tagset;
struct blk_mq_tag_set *admin_tagset;
struct list_head namespaces;
struct mutex namespaces_mutex;
struct device ctrl_device;
struct device *device; /* char device */
struct list_head node;
struct ida ns_ida;
struct cdev cdev;
struct work_struct reset_work;
struct work_struct delete_work;
struct nvme_subsystem *subsys;
struct list_head subsys_entry;
struct opal_dev *opal_dev;
char name[12];
char serial[20];
char model[40];
char firmware_rev[8];
char subnqn[NVMF_NQN_SIZE];
u16 cntlid;
u32 ctrl_config;
@ -155,23 +160,23 @@ struct nvme_ctrl {
u32 page_size;
u32 max_hw_sectors;
u16 oncs;
u16 vid;
u16 oacs;
u16 nssa;
u16 nr_streams;
atomic_t abort_limit;
u8 event_limit;
u8 vwc;
u32 vs;
u32 sgls;
u16 kas;
u8 npss;
u8 apsta;
u32 aen_result;
unsigned int shutdown_timeout;
unsigned int kato;
bool subsystem;
unsigned long quirks;
struct nvme_id_power_state psd[32];
struct nvme_effects_log *effects;
struct work_struct scan_work;
struct work_struct async_event_work;
struct delayed_work ka_work;
@ -197,21 +202,72 @@ struct nvme_ctrl {
struct nvmf_ctrl_options *opts;
};
struct nvme_subsystem {
int instance;
struct device dev;
/*
* Because we unregister the device on the last put we need
* a separate refcount.
*/
struct kref ref;
struct list_head entry;
struct mutex lock;
struct list_head ctrls;
struct list_head nsheads;
char subnqn[NVMF_NQN_SIZE];
char serial[20];
char model[40];
char firmware_rev[8];
u8 cmic;
u16 vendor_id;
struct ida ns_ida;
};
/*
* Container structure for uniqueue namespace identifiers.
*/
struct nvme_ns_ids {
u8 eui64[8];
u8 nguid[16];
uuid_t uuid;
};
/*
* Anchor structure for namespaces. There is one for each namespace in a
* NVMe subsystem that any of our controllers can see, and the namespace
* structure for each controller is chained of it. For private namespaces
* there is a 1:1 relation to our namespace structures, that is ->list
* only ever has a single entry for private namespaces.
*/
struct nvme_ns_head {
#ifdef CONFIG_NVME_MULTIPATH
struct gendisk *disk;
struct nvme_ns __rcu *current_path;
struct bio_list requeue_list;
spinlock_t requeue_lock;
struct work_struct requeue_work;
#endif
struct list_head list;
struct srcu_struct srcu;
struct nvme_subsystem *subsys;
unsigned ns_id;
struct nvme_ns_ids ids;
struct list_head entry;
struct kref ref;
int instance;
};
struct nvme_ns {
struct list_head list;
struct nvme_ctrl *ctrl;
struct request_queue *queue;
struct gendisk *disk;
struct list_head siblings;
struct nvm_dev *ndev;
struct kref kref;
int instance;
struct nvme_ns_head *head;
u8 eui[8];
u8 nguid[16];
uuid_t uuid;
unsigned ns_id;
int lba_shift;
u16 ms;
u16 sgs;
@ -234,9 +290,10 @@ struct nvme_ctrl_ops {
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
void (*free_ctrl)(struct nvme_ctrl *ctrl);
void (*submit_async_event)(struct nvme_ctrl *ctrl, int aer_idx);
int (*delete_ctrl)(struct nvme_ctrl *ctrl);
void (*submit_async_event)(struct nvme_ctrl *ctrl);
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
int (*reinit_request)(void *data, struct request *rq);
};
static inline bool nvme_ctrl_ready(struct nvme_ctrl *ctrl)
@ -278,6 +335,16 @@ static inline void nvme_end_request(struct request *req, __le16 status,
blk_mq_complete_request(req);
}
static inline void nvme_get_ctrl(struct nvme_ctrl *ctrl)
{
get_device(ctrl->device);
}
static inline void nvme_put_ctrl(struct nvme_ctrl *ctrl)
{
put_device(ctrl->device);
}
void nvme_complete_rq(struct request *req);
void nvme_cancel_request(struct request *req, void *data, bool reserved);
bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
@ -299,10 +366,8 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl);
int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
bool send);
#define NVME_NR_AERS 1
void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
union nvme_result *res);
void nvme_queue_async_events(struct nvme_ctrl *ctrl);
void nvme_stop_queues(struct nvme_ctrl *ctrl);
void nvme_start_queues(struct nvme_ctrl *ctrl);
@ -311,21 +376,79 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl);
void nvme_wait_freeze(struct nvme_ctrl *ctrl);
void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
void nvme_start_freeze(struct nvme_ctrl *ctrl);
int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set);
#define NVME_QID_ANY -1
struct request *nvme_alloc_request(struct request_queue *q,
struct nvme_command *cmd, unsigned int flags, int qid);
struct nvme_command *cmd, blk_mq_req_flags_t flags, int qid);
blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmd);
int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
void *buf, unsigned bufflen);
int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
union nvme_result *result, void *buffer, unsigned bufflen,
unsigned timeout, int qid, int at_head, int flags);
unsigned timeout, int qid, int at_head,
blk_mq_req_flags_t flags);
int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
void nvme_start_keep_alive(struct nvme_ctrl *ctrl);
void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
int nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl);
extern const struct attribute_group nvme_ns_id_attr_group;
extern const struct block_device_operations nvme_ns_head_ops;
#ifdef CONFIG_NVME_MULTIPATH
void nvme_failover_req(struct request *req);
bool nvme_req_needs_failover(struct request *req);
void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
void nvme_mpath_add_disk(struct nvme_ns_head *head);
void nvme_mpath_add_disk_links(struct nvme_ns *ns);
void nvme_mpath_remove_disk(struct nvme_ns_head *head);
void nvme_mpath_remove_disk_links(struct nvme_ns *ns);
static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns)
{
struct nvme_ns_head *head = ns->head;
if (head && ns == srcu_dereference(head->current_path, &head->srcu))
rcu_assign_pointer(head->current_path, NULL);
}
struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
#else
static inline void nvme_failover_req(struct request *req)
{
}
static inline bool nvme_req_needs_failover(struct request *req)
{
return false;
}
static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
{
}
static inline int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,
struct nvme_ns_head *head)
{
return 0;
}
static inline void nvme_mpath_add_disk(struct nvme_ns_head *head)
{
}
static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head)
{
}
static inline void nvme_mpath_add_disk_links(struct nvme_ns *ns)
{
}
static inline void nvme_mpath_remove_disk_links(struct nvme_ns *ns)
{
}
static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns)
{
}
#endif /* CONFIG_NVME_MULTIPATH */
#ifdef CONFIG_NVM
int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node);

View File

@ -13,7 +13,6 @@
*/
#include <linux/aer.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/blk-mq-pci.h>
@ -26,12 +25,9 @@
#include <linux/mutex.h>
#include <linux/once.h>
#include <linux/pci.h>
#include <linux/poison.h>
#include <linux/t10-pi.h>
#include <linux/timer.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <asm/unaligned.h>
#include <linux/sed-opal.h>
#include "nvme.h"
@ -39,11 +35,7 @@
#define SQ_SIZE(depth) (depth * sizeof(struct nvme_command))
#define CQ_SIZE(depth) (depth * sizeof(struct nvme_completion))
/*
* We handle AEN commands ourselves and don't even let the
* block layer know about them.
*/
#define NVME_AQ_BLKMQ_DEPTH (NVME_AQ_DEPTH - NVME_NR_AERS)
#define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
static int use_threaded_interrupts;
module_param(use_threaded_interrupts, int, 0);
@ -57,6 +49,12 @@ module_param(max_host_mem_size_mb, uint, 0444);
MODULE_PARM_DESC(max_host_mem_size_mb,
"Maximum Host Memory Buffer (HMB) size per controller (in MiB)");
static unsigned int sgl_threshold = SZ_32K;
module_param(sgl_threshold, uint, 0644);
MODULE_PARM_DESC(sgl_threshold,
"Use SGLs when average request segment size is larger or equal to "
"this size. Use 0 to disable SGLs.");
static int io_queue_depth_set(const char *val, const struct kernel_param *kp);
static const struct kernel_param_ops io_queue_depth_ops = {
.set = io_queue_depth_set,
@ -178,6 +176,7 @@ struct nvme_queue {
struct nvme_iod {
struct nvme_request req;
struct nvme_queue *nvmeq;
bool use_sgl;
int aborted;
int npages; /* In the PRP list. 0 means small pool in use */
int nents; /* Used in scatterlist */
@ -331,17 +330,35 @@ static int nvme_npages(unsigned size, struct nvme_dev *dev)
return DIV_ROUND_UP(8 * nprps, PAGE_SIZE - 8);
}
static unsigned int nvme_iod_alloc_size(struct nvme_dev *dev,
unsigned int size, unsigned int nseg)
/*
* Calculates the number of pages needed for the SGL segments. For example a 4k
* page can accommodate 256 SGL descriptors.
*/
static int nvme_pci_npages_sgl(unsigned int num_seg)
{
return sizeof(__le64 *) * nvme_npages(size, dev) +
sizeof(struct scatterlist) * nseg;
return DIV_ROUND_UP(num_seg * sizeof(struct nvme_sgl_desc), PAGE_SIZE);
}
static unsigned int nvme_cmd_size(struct nvme_dev *dev)
static unsigned int nvme_pci_iod_alloc_size(struct nvme_dev *dev,
unsigned int size, unsigned int nseg, bool use_sgl)
{
return sizeof(struct nvme_iod) +
nvme_iod_alloc_size(dev, NVME_INT_BYTES(dev), NVME_INT_PAGES);
size_t alloc_size;
if (use_sgl)
alloc_size = sizeof(__le64 *) * nvme_pci_npages_sgl(nseg);
else
alloc_size = sizeof(__le64 *) * nvme_npages(size, dev);
return alloc_size + sizeof(struct scatterlist) * nseg;
}
static unsigned int nvme_pci_cmd_size(struct nvme_dev *dev, bool use_sgl)
{
unsigned int alloc_size = nvme_pci_iod_alloc_size(dev,
NVME_INT_BYTES(dev), NVME_INT_PAGES,
use_sgl);
return sizeof(struct nvme_iod) + alloc_size;
}
static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
@ -425,10 +442,10 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
nvmeq->sq_tail = tail;
}
static __le64 **iod_list(struct request *req)
static void **nvme_pci_iod_list(struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
return (__le64 **)(iod->sg + blk_rq_nr_phys_segments(req));
return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
}
static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
@ -438,7 +455,10 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
unsigned int size = blk_rq_payload_bytes(rq);
if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) {
iod->sg = kmalloc(nvme_iod_alloc_size(dev, size, nseg), GFP_ATOMIC);
size_t alloc_size = nvme_pci_iod_alloc_size(dev, size, nseg,
iod->use_sgl);
iod->sg = kmalloc(alloc_size, GFP_ATOMIC);
if (!iod->sg)
return BLK_STS_RESOURCE;
} else {
@ -456,18 +476,31 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev)
static void nvme_free_iod(struct nvme_dev *dev, struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
const int last_prp = dev->ctrl.page_size / 8 - 1;
const int last_prp = dev->ctrl.page_size / sizeof(__le64) - 1;
dma_addr_t dma_addr = iod->first_dma, next_dma_addr;
int i;
__le64 **list = iod_list(req);
dma_addr_t prp_dma = iod->first_dma;
if (iod->npages == 0)
dma_pool_free(dev->prp_small_pool, list[0], prp_dma);
dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
dma_addr);
for (i = 0; i < iod->npages; i++) {
__le64 *prp_list = list[i];
dma_addr_t next_prp_dma = le64_to_cpu(prp_list[last_prp]);
dma_pool_free(dev->prp_page_pool, prp_list, prp_dma);
prp_dma = next_prp_dma;
void *addr = nvme_pci_iod_list(req)[i];
if (iod->use_sgl) {
struct nvme_sgl_desc *sg_list = addr;
next_dma_addr =
le64_to_cpu((sg_list[SGES_PER_PAGE - 1]).addr);
} else {
__le64 *prp_list = addr;
next_dma_addr = le64_to_cpu(prp_list[last_prp]);
}
dma_pool_free(dev->prp_page_pool, addr, dma_addr);
dma_addr = next_dma_addr;
}
if (iod->sg != iod->inline_sg)
@ -555,7 +588,8 @@ static void nvme_print_sgl(struct scatterlist *sgl, int nents)
}
}
static blk_status_t nvme_setup_prps(struct nvme_dev *dev, struct request *req)
static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
struct request *req, struct nvme_rw_command *cmnd)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
@ -566,14 +600,16 @@ static blk_status_t nvme_setup_prps(struct nvme_dev *dev, struct request *req)
u32 page_size = dev->ctrl.page_size;
int offset = dma_addr & (page_size - 1);
__le64 *prp_list;
__le64 **list = iod_list(req);
void **list = nvme_pci_iod_list(req);
dma_addr_t prp_dma;
int nprps, i;
iod->use_sgl = false;
length -= (page_size - offset);
if (length <= 0) {
iod->first_dma = 0;
return BLK_STS_OK;
goto done;
}
dma_len -= (page_size - offset);
@ -587,7 +623,7 @@ static blk_status_t nvme_setup_prps(struct nvme_dev *dev, struct request *req)
if (length <= page_size) {
iod->first_dma = dma_addr;
return BLK_STS_OK;
goto done;
}
nprps = DIV_ROUND_UP(length, page_size);
@ -634,6 +670,10 @@ static blk_status_t nvme_setup_prps(struct nvme_dev *dev, struct request *req)
dma_len = sg_dma_len(sg);
}
done:
cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
return BLK_STS_OK;
bad_sgl:
@ -643,6 +683,110 @@ static blk_status_t nvme_setup_prps(struct nvme_dev *dev, struct request *req)
return BLK_STS_IOERR;
}
static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
struct scatterlist *sg)
{
sge->addr = cpu_to_le64(sg_dma_address(sg));
sge->length = cpu_to_le32(sg_dma_len(sg));
sge->type = NVME_SGL_FMT_DATA_DESC << 4;
}
static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
dma_addr_t dma_addr, int entries)
{
sge->addr = cpu_to_le64(dma_addr);
if (entries < SGES_PER_PAGE) {
sge->length = cpu_to_le32(entries * sizeof(*sge));
sge->type = NVME_SGL_FMT_LAST_SEG_DESC << 4;
} else {
sge->length = cpu_to_le32(PAGE_SIZE);
sge->type = NVME_SGL_FMT_SEG_DESC << 4;
}
}
static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
struct request *req, struct nvme_rw_command *cmd)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
int length = blk_rq_payload_bytes(req);
struct dma_pool *pool;
struct nvme_sgl_desc *sg_list;
struct scatterlist *sg = iod->sg;
int entries = iod->nents, i = 0;
dma_addr_t sgl_dma;
iod->use_sgl = true;
/* setting the transfer type as SGL */
cmd->flags = NVME_CMD_SGL_METABUF;
if (length == sg_dma_len(sg)) {
nvme_pci_sgl_set_data(&cmd->dptr.sgl, sg);
return BLK_STS_OK;
}
if (entries <= (256 / sizeof(struct nvme_sgl_desc))) {
pool = dev->prp_small_pool;
iod->npages = 0;
} else {
pool = dev->prp_page_pool;
iod->npages = 1;
}
sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma);
if (!sg_list) {
iod->npages = -1;
return BLK_STS_RESOURCE;
}
nvme_pci_iod_list(req)[0] = sg_list;
iod->first_dma = sgl_dma;
nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
do {
if (i == SGES_PER_PAGE) {
struct nvme_sgl_desc *old_sg_desc = sg_list;
struct nvme_sgl_desc *link = &old_sg_desc[i - 1];
sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma);
if (!sg_list)
return BLK_STS_RESOURCE;
i = 0;
nvme_pci_iod_list(req)[iod->npages++] = sg_list;
sg_list[i++] = *link;
nvme_pci_sgl_set_seg(link, sgl_dma, entries);
}
nvme_pci_sgl_set_data(&sg_list[i++], sg);
length -= sg_dma_len(sg);
sg = sg_next(sg);
entries--;
} while (length > 0);
WARN_ON(entries > 0);
return BLK_STS_OK;
}
static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
unsigned int avg_seg_size;
avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req),
blk_rq_nr_phys_segments(req));
if (!(dev->ctrl.sgls & ((1 << 0) | (1 << 1))))
return false;
if (!iod->nvmeq->qid)
return false;
if (!sgl_threshold || avg_seg_size < sgl_threshold)
return false;
return true;
}
static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
struct nvme_command *cmnd)
{
@ -662,7 +806,11 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
DMA_ATTR_NO_WARN))
goto out;
ret = nvme_setup_prps(dev, req);
if (nvme_pci_use_sgls(dev, req))
ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
else
ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
if (ret != BLK_STS_OK)
goto out_unmap;
@ -682,8 +830,6 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
goto out_unmap;
}
cmnd->rw.dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
cmnd->rw.dptr.prp2 = cpu_to_le64(iod->first_dma);
if (blk_integrity_rq(req))
cmnd->rw.metadata = cpu_to_le64(sg_dma_address(&iod->meta_sg));
return BLK_STS_OK;
@ -804,7 +950,7 @@ static inline void nvme_handle_cqe(struct nvme_queue *nvmeq,
* for them but rather special case them here.
*/
if (unlikely(nvmeq->qid == 0 &&
cqe->command_id >= NVME_AQ_BLKMQ_DEPTH)) {
cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH)) {
nvme_complete_async_event(&nvmeq->dev->ctrl,
cqe->status, &cqe->result);
return;
@ -897,7 +1043,7 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
return __nvme_poll(nvmeq, tag);
}
static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl, int aer_idx)
static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl)
{
struct nvme_dev *dev = to_nvme_dev(ctrl);
struct nvme_queue *nvmeq = dev->queues[0];
@ -905,7 +1051,7 @@ static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl, int aer_idx)
memset(&c, 0, sizeof(c));
c.common.opcode = nvme_admin_async_event;
c.common.command_id = NVME_AQ_BLKMQ_DEPTH + aer_idx;
c.common.command_id = NVME_AQ_BLK_MQ_DEPTH;
spin_lock_irq(&nvmeq->q_lock);
__nvme_submit_cmd(nvmeq, &c);
@ -930,7 +1076,7 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;
/*
* Note: we (ab)use the fact the the prp fields survive if no data
* Note: we (ab)use the fact that the prp fields survive if no data
* is attached to the request.
*/
memset(&c, 0, sizeof(c));
@ -951,7 +1097,7 @@ static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
int flags = NVME_QUEUE_PHYS_CONTIG;
/*
* Note: we (ab)use the fact the the prp fields survive if no data
* Note: we (ab)use the fact that the prp fields survive if no data
* is attached to the request.
*/
memset(&c, 0, sizeof(c));
@ -1372,14 +1518,10 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
dev->admin_tagset.ops = &nvme_mq_admin_ops;
dev->admin_tagset.nr_hw_queues = 1;
/*
* Subtract one to leave an empty queue entry for 'Full Queue'
* condition. See NVM-Express 1.2 specification, section 4.1.2.
*/
dev->admin_tagset.queue_depth = NVME_AQ_BLKMQ_DEPTH - 1;
dev->admin_tagset.queue_depth = NVME_AQ_MQ_TAG_DEPTH;
dev->admin_tagset.timeout = ADMIN_TIMEOUT;
dev->admin_tagset.numa_node = dev_to_node(dev->dev);
dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
dev->admin_tagset.cmd_size = nvme_pci_cmd_size(dev, false);
dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED;
dev->admin_tagset.driver_data = dev;
@ -1906,7 +2048,11 @@ static int nvme_dev_add(struct nvme_dev *dev)
dev->tagset.numa_node = dev_to_node(dev->dev);
dev->tagset.queue_depth =
min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH) - 1;
dev->tagset.cmd_size = nvme_cmd_size(dev);
dev->tagset.cmd_size = nvme_pci_cmd_size(dev, false);
if ((dev->ctrl.sgls & ((1 << 0) | (1 << 1))) && sgl_threshold) {
dev->tagset.cmd_size = max(dev->tagset.cmd_size,
nvme_pci_cmd_size(dev, true));
}
dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
dev->tagset.driver_data = dev;
@ -2132,9 +2278,9 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
{
dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
kref_get(&dev->ctrl.kref);
nvme_get_ctrl(&dev->ctrl);
nvme_dev_disable(dev, false);
if (!schedule_work(&dev->remove_work))
if (!queue_work(nvme_wq, &dev->remove_work))
nvme_put_ctrl(&dev->ctrl);
}
@ -2557,6 +2703,7 @@ static int __init nvme_init(void)
static void __exit nvme_exit(void)
{
pci_unregister_driver(&nvme_driver);
flush_workqueue(nvme_wq);
_nvme_check_size();
}

View File

@ -41,17 +41,9 @@
#define NVME_RDMA_MAX_INLINE_SEGMENTS 1
/*
* We handle AEN commands ourselves and don't even let the
* block layer know about them.
*/
#define NVME_RDMA_NR_AEN_COMMANDS 1
#define NVME_RDMA_AQ_BLKMQ_DEPTH \
(NVME_AQ_DEPTH - NVME_RDMA_NR_AEN_COMMANDS)
struct nvme_rdma_device {
struct ib_device *dev;
struct ib_pd *pd;
struct ib_device *dev;
struct ib_pd *pd;
struct kref ref;
struct list_head entry;
};
@ -79,8 +71,8 @@ struct nvme_rdma_request {
};
enum nvme_rdma_queue_flags {
NVME_RDMA_Q_LIVE = 0,
NVME_RDMA_Q_DELETING = 1,
NVME_RDMA_Q_ALLOCATED = 0,
NVME_RDMA_Q_LIVE = 1,
};
struct nvme_rdma_queue {
@ -105,7 +97,6 @@ struct nvme_rdma_ctrl {
/* other member variables */
struct blk_mq_tag_set tag_set;
struct work_struct delete_work;
struct work_struct err_work;
struct nvme_rdma_qe async_event_sqe;
@ -274,6 +265,9 @@ static int nvme_rdma_reinit_request(void *data, struct request *rq)
struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
int ret = 0;
if (WARN_ON_ONCE(!req->mr))
return 0;
ib_dereg_mr(req->mr);
req->mr = ib_alloc_mr(dev->pd, IB_MR_TYPE_MEM_REG,
@ -434,11 +428,9 @@ out_err:
static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
{
struct nvme_rdma_device *dev;
struct ib_device *ibdev;
struct nvme_rdma_device *dev = queue->device;
struct ib_device *ibdev = dev->dev;
dev = queue->device;
ibdev = dev->dev;
rdma_destroy_qp(queue->cm_id);
ib_free_cq(queue->ib_cq);
@ -493,7 +485,7 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
return 0;
out_destroy_qp:
ib_destroy_qp(queue->qp);
rdma_destroy_qp(queue->cm_id);
out_destroy_ib_cq:
ib_free_cq(queue->ib_cq);
out_put_dev:
@ -544,11 +536,11 @@ static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl,
ret = nvme_rdma_wait_for_cm(queue);
if (ret) {
dev_info(ctrl->ctrl.device,
"rdma_resolve_addr wait failed (%d).\n", ret);
"rdma connection establishment failed (%d)\n", ret);
goto out_destroy_cm_id;
}
clear_bit(NVME_RDMA_Q_DELETING, &queue->flags);
set_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags);
return 0;
@ -568,7 +560,7 @@ static void nvme_rdma_stop_queue(struct nvme_rdma_queue *queue)
static void nvme_rdma_free_queue(struct nvme_rdma_queue *queue)
{
if (test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags))
if (!test_and_clear_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
return;
if (nvme_rdma_queue_idx(queue) == 0) {
@ -676,11 +668,10 @@ out_free_queues:
return ret;
}
static void nvme_rdma_free_tagset(struct nvme_ctrl *nctrl, bool admin)
static void nvme_rdma_free_tagset(struct nvme_ctrl *nctrl,
struct blk_mq_tag_set *set)
{
struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(nctrl);
struct blk_mq_tag_set *set = admin ?
&ctrl->admin_tag_set : &ctrl->tag_set;
blk_mq_free_tag_set(set);
nvme_rdma_dev_put(ctrl->device);
@ -697,7 +688,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
set = &ctrl->admin_tag_set;
memset(set, 0, sizeof(*set));
set->ops = &nvme_rdma_admin_mq_ops;
set->queue_depth = NVME_RDMA_AQ_BLKMQ_DEPTH;
set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
set->reserved_tags = 2; /* connect + keep-alive */
set->numa_node = NUMA_NO_NODE;
set->cmd_size = sizeof(struct nvme_rdma_request) +
@ -705,6 +696,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
set->driver_data = ctrl;
set->nr_hw_queues = 1;
set->timeout = ADMIN_TIMEOUT;
set->flags = BLK_MQ_F_NO_SCHED;
} else {
set = &ctrl->tag_set;
memset(set, 0, sizeof(*set));
@ -748,7 +740,7 @@ static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl,
nvme_rdma_stop_queue(&ctrl->queues[0]);
if (remove) {
blk_cleanup_queue(ctrl->ctrl.admin_q);
nvme_rdma_free_tagset(&ctrl->ctrl, true);
nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.admin_tagset);
}
nvme_rdma_free_queue(&ctrl->queues[0]);
}
@ -780,8 +772,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
goto out_free_tagset;
}
} else {
error = blk_mq_reinit_tagset(&ctrl->admin_tag_set,
nvme_rdma_reinit_request);
error = nvme_reinit_tagset(&ctrl->ctrl, ctrl->ctrl.admin_tagset);
if (error)
goto out_free_queue;
}
@ -825,7 +816,7 @@ out_cleanup_queue:
blk_cleanup_queue(ctrl->ctrl.admin_q);
out_free_tagset:
if (new)
nvme_rdma_free_tagset(&ctrl->ctrl, true);
nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.admin_tagset);
out_free_queue:
nvme_rdma_free_queue(&ctrl->queues[0]);
return error;
@ -837,7 +828,7 @@ static void nvme_rdma_destroy_io_queues(struct nvme_rdma_ctrl *ctrl,
nvme_rdma_stop_io_queues(ctrl);
if (remove) {
blk_cleanup_queue(ctrl->ctrl.connect_q);
nvme_rdma_free_tagset(&ctrl->ctrl, false);
nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.tagset);
}
nvme_rdma_free_io_queues(ctrl);
}
@ -863,8 +854,7 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
goto out_free_tag_set;
}
} else {
ret = blk_mq_reinit_tagset(&ctrl->tag_set,
nvme_rdma_reinit_request);
ret = nvme_reinit_tagset(&ctrl->ctrl, ctrl->ctrl.tagset);
if (ret)
goto out_free_io_queues;
@ -883,7 +873,7 @@ out_cleanup_connect_q:
blk_cleanup_queue(ctrl->ctrl.connect_q);
out_free_tag_set:
if (new)
nvme_rdma_free_tagset(&ctrl->ctrl, false);
nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.tagset);
out_free_io_queues:
nvme_rdma_free_io_queues(ctrl);
return ret;
@ -922,7 +912,7 @@ static void nvme_rdma_reconnect_or_remove(struct nvme_rdma_ctrl *ctrl)
ctrl->ctrl.opts->reconnect_delay * HZ);
} else {
dev_info(ctrl->ctrl.device, "Removing controller...\n");
queue_work(nvme_wq, &ctrl->delete_work);
nvme_delete_ctrl(&ctrl->ctrl);
}
}
@ -935,10 +925,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
++ctrl->ctrl.nr_reconnects;
if (ctrl->ctrl.queue_count > 1)
nvme_rdma_destroy_io_queues(ctrl, false);
nvme_rdma_destroy_admin_queue(ctrl, false);
ret = nvme_rdma_configure_admin_queue(ctrl, false);
if (ret)
goto requeue;
@ -946,7 +932,7 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
if (ctrl->ctrl.queue_count > 1) {
ret = nvme_rdma_configure_io_queues(ctrl, false);
if (ret)
goto requeue;
goto destroy_admin;
}
changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
@ -956,14 +942,17 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
return;
}
ctrl->ctrl.nr_reconnects = 0;
nvme_start_ctrl(&ctrl->ctrl);
dev_info(ctrl->ctrl.device, "Successfully reconnected\n");
dev_info(ctrl->ctrl.device, "Successfully reconnected (%d attempts)\n",
ctrl->ctrl.nr_reconnects);
ctrl->ctrl.nr_reconnects = 0;
return;
destroy_admin:
nvme_rdma_destroy_admin_queue(ctrl, false);
requeue:
dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
ctrl->ctrl.nr_reconnects);
@ -979,17 +968,15 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
if (ctrl->ctrl.queue_count > 1) {
nvme_stop_queues(&ctrl->ctrl);
nvme_rdma_stop_io_queues(ctrl);
}
blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
nvme_rdma_stop_queue(&ctrl->queues[0]);
/* We must take care of fastfail/requeue all our inflight requests */
if (ctrl->ctrl.queue_count > 1)
blk_mq_tagset_busy_iter(&ctrl->tag_set,
nvme_cancel_request, &ctrl->ctrl);
nvme_rdma_destroy_io_queues(ctrl, false);
}
blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
nvme_cancel_request, &ctrl->ctrl);
nvme_rdma_destroy_admin_queue(ctrl, false);
/*
* queues are not a live anymore, so restart the queues to fail fast
@ -1065,7 +1052,7 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue,
if (!blk_rq_bytes(rq))
return;
if (req->mr->need_inval) {
if (req->mr->need_inval && test_bit(NVME_RDMA_Q_LIVE, &req->queue->flags)) {
res = nvme_rdma_inv_rkey(queue, req);
if (unlikely(res < 0)) {
dev_err(ctrl->ctrl.device,
@ -1314,7 +1301,7 @@ static struct blk_mq_tags *nvme_rdma_tagset(struct nvme_rdma_queue *queue)
return queue->ctrl->tag_set.tags[queue_idx - 1];
}
static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg)
{
struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(arg);
struct nvme_rdma_queue *queue = &ctrl->queues[0];
@ -1324,14 +1311,11 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
struct ib_sge sge;
int ret;
if (WARN_ON_ONCE(aer_idx != 0))
return;
ib_dma_sync_single_for_cpu(dev, sqe->dma, sizeof(*cmd), DMA_TO_DEVICE);
memset(cmd, 0, sizeof(*cmd));
cmd->common.opcode = nvme_admin_async_event;
cmd->common.command_id = NVME_RDMA_AQ_BLKMQ_DEPTH;
cmd->common.command_id = NVME_AQ_BLK_MQ_DEPTH;
cmd->common.flags |= NVME_CMD_SGL_METABUF;
nvme_rdma_set_sg_null(cmd);
@ -1393,7 +1377,7 @@ static int __nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc, int tag)
* for them but rather special case them here.
*/
if (unlikely(nvme_rdma_queue_idx(queue) == 0 &&
cqe->command_id >= NVME_RDMA_AQ_BLKMQ_DEPTH))
cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH))
nvme_complete_async_event(&queue->ctrl->ctrl, cqe->status,
&cqe->result);
else
@ -1590,6 +1574,10 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
{
struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
dev_warn(req->queue->ctrl->ctrl.device,
"I/O %d QID %d timeout, reset controller\n",
rq->tag, nvme_rdma_queue_idx(req->queue));
/* queue error recovery */
nvme_rdma_error_recovery(req->queue->ctrl);
@ -1767,50 +1755,9 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown)
nvme_rdma_destroy_admin_queue(ctrl, shutdown);
}
static void nvme_rdma_remove_ctrl(struct nvme_rdma_ctrl *ctrl)
static void nvme_rdma_delete_ctrl(struct nvme_ctrl *ctrl)
{
nvme_remove_namespaces(&ctrl->ctrl);
nvme_rdma_shutdown_ctrl(ctrl, true);
nvme_uninit_ctrl(&ctrl->ctrl);
nvme_put_ctrl(&ctrl->ctrl);
}
static void nvme_rdma_del_ctrl_work(struct work_struct *work)
{
struct nvme_rdma_ctrl *ctrl = container_of(work,
struct nvme_rdma_ctrl, delete_work);
nvme_stop_ctrl(&ctrl->ctrl);
nvme_rdma_remove_ctrl(ctrl);
}
static int __nvme_rdma_del_ctrl(struct nvme_rdma_ctrl *ctrl)
{
if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_DELETING))
return -EBUSY;
if (!queue_work(nvme_wq, &ctrl->delete_work))
return -EBUSY;
return 0;
}
static int nvme_rdma_del_ctrl(struct nvme_ctrl *nctrl)
{
struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(nctrl);
int ret = 0;
/*
* Keep a reference until all work is flushed since
* __nvme_rdma_del_ctrl can free the ctrl mem
*/
if (!kref_get_unless_zero(&ctrl->ctrl.kref))
return -EBUSY;
ret = __nvme_rdma_del_ctrl(ctrl);
if (!ret)
flush_work(&ctrl->delete_work);
nvme_put_ctrl(&ctrl->ctrl);
return ret;
nvme_rdma_shutdown_ctrl(to_rdma_ctrl(ctrl), true);
}
static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
@ -1834,7 +1781,11 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
}
changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
WARN_ON_ONCE(!changed);
if (!changed) {
/* state change failure is ok if we're in DELETING state */
WARN_ON_ONCE(ctrl->ctrl.state != NVME_CTRL_DELETING);
return;
}
nvme_start_ctrl(&ctrl->ctrl);
@ -1842,7 +1793,10 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
out_fail:
dev_warn(ctrl->ctrl.device, "Removing after reset failure\n");
nvme_rdma_remove_ctrl(ctrl);
nvme_remove_namespaces(&ctrl->ctrl);
nvme_rdma_shutdown_ctrl(ctrl, true);
nvme_uninit_ctrl(&ctrl->ctrl);
nvme_put_ctrl(&ctrl->ctrl);
}
static const struct nvme_ctrl_ops nvme_rdma_ctrl_ops = {
@ -1854,10 +1808,88 @@ static const struct nvme_ctrl_ops nvme_rdma_ctrl_ops = {
.reg_write32 = nvmf_reg_write32,
.free_ctrl = nvme_rdma_free_ctrl,
.submit_async_event = nvme_rdma_submit_async_event,
.delete_ctrl = nvme_rdma_del_ctrl,
.delete_ctrl = nvme_rdma_delete_ctrl,
.get_address = nvmf_get_address,
.reinit_request = nvme_rdma_reinit_request,
};
static inline bool
__nvme_rdma_options_match(struct nvme_rdma_ctrl *ctrl,
struct nvmf_ctrl_options *opts)
{
char *stdport = __stringify(NVME_RDMA_IP_PORT);
if (!nvmf_ctlr_matches_baseopts(&ctrl->ctrl, opts) ||
strcmp(opts->traddr, ctrl->ctrl.opts->traddr))
return false;
if (opts->mask & NVMF_OPT_TRSVCID &&
ctrl->ctrl.opts->mask & NVMF_OPT_TRSVCID) {
if (strcmp(opts->trsvcid, ctrl->ctrl.opts->trsvcid))
return false;
} else if (opts->mask & NVMF_OPT_TRSVCID) {
if (strcmp(opts->trsvcid, stdport))
return false;
} else if (ctrl->ctrl.opts->mask & NVMF_OPT_TRSVCID) {
if (strcmp(stdport, ctrl->ctrl.opts->trsvcid))
return false;
}
/* else, it's a match as both have stdport. Fall to next checks */
/*
* checking the local address is rough. In most cases, one
* is not specified and the host port is selected by the stack.
*
* Assume no match if:
* local address is specified and address is not the same
* local address is not specified but remote is, or vice versa
* (admin using specific host_traddr when it matters).
*/
if (opts->mask & NVMF_OPT_HOST_TRADDR &&
ctrl->ctrl.opts->mask & NVMF_OPT_HOST_TRADDR) {
if (strcmp(opts->host_traddr, ctrl->ctrl.opts->host_traddr))
return false;
} else if (opts->mask & NVMF_OPT_HOST_TRADDR ||
ctrl->ctrl.opts->mask & NVMF_OPT_HOST_TRADDR)
return false;
/*
* if neither controller had an host port specified, assume it's
* a match as everything else matched.
*/
return true;
}
/*
* Fails a connection request if it matches an existing controller
* (association) with the same tuple:
* <Host NQN, Host ID, local address, remote address, remote port, SUBSYS NQN>
*
* if local address is not specified in the request, it will match an
* existing controller with all the other parameters the same and no
* local port address specified as well.
*
* The ports don't need to be compared as they are intrinsically
* already matched by the port pointers supplied.
*/
static bool
nvme_rdma_existing_controller(struct nvmf_ctrl_options *opts)
{
struct nvme_rdma_ctrl *ctrl;
bool found = false;
mutex_lock(&nvme_rdma_ctrl_mutex);
list_for_each_entry(ctrl, &nvme_rdma_ctrl_list, list) {
found = __nvme_rdma_options_match(ctrl, opts);
if (found)
break;
}
mutex_unlock(&nvme_rdma_ctrl_mutex);
return found;
}
static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
struct nvmf_ctrl_options *opts)
{
@ -1894,6 +1926,11 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
}
}
if (!opts->duplicate_connect && nvme_rdma_existing_controller(opts)) {
ret = -EALREADY;
goto out_free_ctrl;
}
ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_rdma_ctrl_ops,
0 /* no quirks, we're perfect! */);
if (ret)
@ -1902,7 +1939,6 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
INIT_DELAYED_WORK(&ctrl->reconnect_work,
nvme_rdma_reconnect_ctrl_work);
INIT_WORK(&ctrl->err_work, nvme_rdma_error_recovery_work);
INIT_WORK(&ctrl->delete_work, nvme_rdma_del_ctrl_work);
INIT_WORK(&ctrl->ctrl.reset_work, nvme_rdma_reset_ctrl_work);
ctrl->ctrl.queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */
@ -1961,7 +1997,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISpcs\n",
ctrl->ctrl.opts->subsysnqn, &ctrl->addr);
kref_get(&ctrl->ctrl.kref);
nvme_get_ctrl(&ctrl->ctrl);
mutex_lock(&nvme_rdma_ctrl_mutex);
list_add_tail(&ctrl->list, &nvme_rdma_ctrl_list);
@ -2006,7 +2042,7 @@ static void nvme_rdma_remove_one(struct ib_device *ib_device, void *client_data)
dev_info(ctrl->ctrl.device,
"Removing ctrl: NQN \"%s\", addr %pISp\n",
ctrl->ctrl.opts->subsysnqn, &ctrl->addr);
__nvme_rdma_del_ctrl(ctrl);
nvme_delete_ctrl(&ctrl->ctrl);
}
mutex_unlock(&nvme_rdma_ctrl_mutex);

View File

@ -35,17 +35,14 @@ u32 nvmet_get_log_page_len(struct nvme_command *cmd)
static u16 nvmet_get_smart_log_nsid(struct nvmet_req *req,
struct nvme_smart_log *slog)
{
u16 status;
struct nvmet_ns *ns;
u64 host_reads, host_writes, data_units_read, data_units_written;
status = NVME_SC_SUCCESS;
ns = nvmet_find_namespace(req->sq->ctrl, req->cmd->get_log_page.nsid);
if (!ns) {
status = NVME_SC_INVALID_NS;
pr_err("nvmet : Could not find namespace id : %d\n",
le32_to_cpu(req->cmd->get_log_page.nsid));
goto out;
return NVME_SC_INVALID_NS;
}
host_reads = part_stat_read(ns->bdev->bd_part, ios[READ]);
@ -58,20 +55,18 @@ static u16 nvmet_get_smart_log_nsid(struct nvmet_req *req,
put_unaligned_le64(host_writes, &slog->host_writes[0]);
put_unaligned_le64(data_units_written, &slog->data_units_written[0]);
nvmet_put_namespace(ns);
out:
return status;
return NVME_SC_SUCCESS;
}
static u16 nvmet_get_smart_log_all(struct nvmet_req *req,
struct nvme_smart_log *slog)
{
u16 status;
u64 host_reads = 0, host_writes = 0;
u64 data_units_read = 0, data_units_written = 0;
struct nvmet_ns *ns;
struct nvmet_ctrl *ctrl;
status = NVME_SC_SUCCESS;
ctrl = req->sq->ctrl;
rcu_read_lock();
@ -91,7 +86,7 @@ static u16 nvmet_get_smart_log_all(struct nvmet_req *req,
put_unaligned_le64(host_writes, &slog->host_writes[0]);
put_unaligned_le64(data_units_written, &slog->data_units_written[0]);
return status;
return NVME_SC_SUCCESS;
}
static u16 nvmet_get_smart_log(struct nvmet_req *req,
@ -144,10 +139,8 @@ static void nvmet_execute_get_log_page(struct nvmet_req *req)
}
smart_log = buf;
status = nvmet_get_smart_log(req, smart_log);
if (status) {
memset(buf, '\0', data_len);
if (status)
goto err;
}
break;
case NVME_LOG_FW_SLOT:
/*
@ -300,7 +293,7 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req)
}
/*
* nuse = ncap = nsze isn't aways true, but we have no way to find
* nuse = ncap = nsze isn't always true, but we have no way to find
* that out from the underlying device.
*/
id->ncap = id->nuse = id->nsze =
@ -424,7 +417,7 @@ out:
}
/*
* A "mimimum viable" abort implementation: the command is mandatory in the
* A "minimum viable" abort implementation: the command is mandatory in the
* spec, but we are not required to do any useful work. We couldn't really
* do a useful abort, so don't bother even with waiting for the command
* to be exectuted and return immediately telling the command to abort

View File

@ -57,6 +57,17 @@ u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t len)
return 0;
}
static unsigned int nvmet_max_nsid(struct nvmet_subsys *subsys)
{
struct nvmet_ns *ns;
if (list_empty(&subsys->namespaces))
return 0;
ns = list_last_entry(&subsys->namespaces, struct nvmet_ns, dev_link);
return ns->nsid;
}
static u32 nvmet_async_event_result(struct nvmet_async_event *aen)
{
return aen->event_type | (aen->event_info << 8) | (aen->log_page << 16);
@ -334,6 +345,8 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
ns->enabled = false;
list_del_rcu(&ns->dev_link);
if (ns->nsid == subsys->max_nsid)
subsys->max_nsid = nvmet_max_nsid(subsys);
mutex_unlock(&subsys->lock);
/*
@ -497,6 +510,7 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq,
req->ops = ops;
req->sg = NULL;
req->sg_cnt = 0;
req->transfer_len = 0;
req->rsp->status = 0;
/* no support for fused commands yet */
@ -546,6 +560,15 @@ void nvmet_req_uninit(struct nvmet_req *req)
}
EXPORT_SYMBOL_GPL(nvmet_req_uninit);
void nvmet_req_execute(struct nvmet_req *req)
{
if (unlikely(req->data_len != req->transfer_len))
nvmet_req_complete(req, NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR);
else
req->execute(req);
}
EXPORT_SYMBOL_GPL(nvmet_req_execute);
static inline bool nvmet_cc_en(u32 cc)
{
return (cc >> NVME_CC_EN_SHIFT) & 0x1;

View File

@ -76,7 +76,6 @@ struct nvmet_fc_fcp_iod {
dma_addr_t rspdma;
struct scatterlist *data_sg;
int data_sg_cnt;
u32 total_length;
u32 offset;
enum nvmet_fcp_datadir io_dir;
bool active;
@ -150,6 +149,7 @@ struct nvmet_fc_tgt_assoc {
struct list_head a_list;
struct nvmet_fc_tgt_queue *queues[NVMET_NR_QUEUES + 1];
struct kref ref;
struct work_struct del_work;
};
@ -232,6 +232,7 @@ static void nvmet_fc_tgtport_put(struct nvmet_fc_tgtport *tgtport);
static int nvmet_fc_tgtport_get(struct nvmet_fc_tgtport *tgtport);
static void nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport *tgtport,
struct nvmet_fc_fcp_iod *fod);
static void nvmet_fc_delete_target_assoc(struct nvmet_fc_tgt_assoc *assoc);
/* *********************** FC-NVME DMA Handling **************************** */
@ -802,6 +803,16 @@ nvmet_fc_find_target_queue(struct nvmet_fc_tgtport *tgtport,
return NULL;
}
static void
nvmet_fc_delete_assoc(struct work_struct *work)
{
struct nvmet_fc_tgt_assoc *assoc =
container_of(work, struct nvmet_fc_tgt_assoc, del_work);
nvmet_fc_delete_target_assoc(assoc);
nvmet_fc_tgt_a_put(assoc);
}
static struct nvmet_fc_tgt_assoc *
nvmet_fc_alloc_target_assoc(struct nvmet_fc_tgtport *tgtport)
{
@ -826,6 +837,7 @@ nvmet_fc_alloc_target_assoc(struct nvmet_fc_tgtport *tgtport)
assoc->a_id = idx;
INIT_LIST_HEAD(&assoc->a_list);
kref_init(&assoc->ref);
INIT_WORK(&assoc->del_work, nvmet_fc_delete_assoc);
while (needrandom) {
get_random_bytes(&ran, sizeof(ran) - BYTES_FOR_QID);
@ -1118,8 +1130,7 @@ nvmet_fc_delete_ctrl(struct nvmet_ctrl *ctrl)
nvmet_fc_tgtport_put(tgtport);
if (found_ctrl) {
nvmet_fc_delete_target_assoc(assoc);
nvmet_fc_tgt_a_put(assoc);
schedule_work(&assoc->del_work);
return;
}
@ -1688,7 +1699,7 @@ nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
u32 page_len, length;
int i = 0;
length = fod->total_length;
length = fod->req.transfer_len;
nent = DIV_ROUND_UP(length, PAGE_SIZE);
sg = kmalloc_array(nent, sizeof(struct scatterlist), GFP_KERNEL);
if (!sg)
@ -1777,7 +1788,7 @@ nvmet_fc_prep_fcp_rsp(struct nvmet_fc_tgtport *tgtport,
u32 rsn, rspcnt, xfr_length;
if (fod->fcpreq->op == NVMET_FCOP_READDATA_RSP)
xfr_length = fod->total_length;
xfr_length = fod->req.transfer_len;
else
xfr_length = fod->offset;
@ -1803,7 +1814,7 @@ nvmet_fc_prep_fcp_rsp(struct nvmet_fc_tgtport *tgtport,
rspcnt = atomic_inc_return(&fod->queue->zrspcnt);
if (!(rspcnt % fod->queue->ersp_ratio) ||
sqe->opcode == nvme_fabrics_command ||
xfr_length != fod->total_length ||
xfr_length != fod->req.transfer_len ||
(le16_to_cpu(cqe->status) & 0xFFFE) || cqewd[0] || cqewd[1] ||
(sqe->flags & (NVME_CMD_FUSE_FIRST | NVME_CMD_FUSE_SECOND)) ||
queue_90percent_full(fod->queue, le16_to_cpu(cqe->sq_head)))
@ -1880,7 +1891,7 @@ nvmet_fc_transfer_fcp_data(struct nvmet_fc_tgtport *tgtport,
fcpreq->timeout = NVME_FC_TGTOP_TIMEOUT_SEC;
tlen = min_t(u32, tgtport->max_sg_cnt * PAGE_SIZE,
(fod->total_length - fod->offset));
(fod->req.transfer_len - fod->offset));
fcpreq->transfer_length = tlen;
fcpreq->transferred_length = 0;
fcpreq->fcp_error = 0;
@ -1894,7 +1905,7 @@ nvmet_fc_transfer_fcp_data(struct nvmet_fc_tgtport *tgtport,
* combined xfr with response.
*/
if ((op == NVMET_FCOP_READDATA) &&
((fod->offset + fcpreq->transfer_length) == fod->total_length) &&
((fod->offset + fcpreq->transfer_length) == fod->req.transfer_len) &&
(tgtport->ops->target_features & NVMET_FCTGTFEAT_READDATA_RSP)) {
fcpreq->op = NVMET_FCOP_READDATA_RSP;
nvmet_fc_prep_fcp_rsp(tgtport, fod);
@ -1974,7 +1985,7 @@ nvmet_fc_fod_op_done(struct nvmet_fc_fcp_iod *fod)
}
fod->offset += fcpreq->transferred_length;
if (fod->offset != fod->total_length) {
if (fod->offset != fod->req.transfer_len) {
spin_lock_irqsave(&fod->flock, flags);
fod->writedataactive = true;
spin_unlock_irqrestore(&fod->flock, flags);
@ -1986,9 +1997,7 @@ nvmet_fc_fod_op_done(struct nvmet_fc_fcp_iod *fod)
}
/* data transfer complete, resume with nvmet layer */
fod->req.execute(&fod->req);
nvmet_req_execute(&fod->req);
break;
case NVMET_FCOP_READDATA:
@ -2011,7 +2020,7 @@ nvmet_fc_fod_op_done(struct nvmet_fc_fcp_iod *fod)
}
fod->offset += fcpreq->transferred_length;
if (fod->offset != fod->total_length) {
if (fod->offset != fod->req.transfer_len) {
/* transfer the next chunk */
nvmet_fc_transfer_fcp_data(tgtport, fod,
NVMET_FCOP_READDATA);
@ -2148,7 +2157,7 @@ nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport *tgtport,
fod->fcpreq->done = nvmet_fc_xmt_fcp_op_done;
fod->total_length = be32_to_cpu(cmdiu->data_len);
fod->req.transfer_len = be32_to_cpu(cmdiu->data_len);
if (cmdiu->flags & FCNVME_CMD_FLAGS_WRITE) {
fod->io_dir = NVMET_FCP_WRITE;
if (!nvme_is_write(&cmdiu->sqe))
@ -2159,7 +2168,7 @@ nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport *tgtport,
goto transport_error;
} else {
fod->io_dir = NVMET_FCP_NODATA;
if (fod->total_length)
if (fod->req.transfer_len)
goto transport_error;
}
@ -2167,9 +2176,6 @@ nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport *tgtport,
fod->req.rsp = &fod->rspiubuf.cqe;
fod->req.port = fod->queue->port;
/* ensure nvmet handlers will set cmd handler callback */
fod->req.execute = NULL;
/* clear any response payload */
memset(&fod->rspiubuf, 0, sizeof(fod->rspiubuf));
@ -2189,7 +2195,7 @@ nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport *tgtport,
/* keep a running counter of tail position */
atomic_inc(&fod->queue->sqtail);
if (fod->total_length) {
if (fod->req.transfer_len) {
ret = nvmet_fc_alloc_tgt_pgs(fod);
if (ret) {
nvmet_req_complete(&fod->req, ret);
@ -2212,9 +2218,7 @@ nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport *tgtport,
* can invoke the nvmet_layer now. If read data, cmd completion will
* push the data
*/
fod->req.execute(&fod->req);
nvmet_req_execute(&fod->req);
return;
transport_error:

View File

@ -33,18 +33,11 @@ static inline u32 nvmet_rw_len(struct nvmet_req *req)
req->ns->blksize_shift;
}
static void nvmet_inline_bio_init(struct nvmet_req *req)
{
struct bio *bio = &req->inline_bio;
bio_init(bio, req->inline_bvec, NVMET_MAX_INLINE_BIOVEC);
}
static void nvmet_execute_rw(struct nvmet_req *req)
{
int sg_cnt = req->sg_cnt;
struct bio *bio = &req->inline_bio;
struct scatterlist *sg;
struct bio *bio;
sector_t sector;
blk_qc_t cookie;
int op, op_flags = 0, i;
@ -66,8 +59,7 @@ static void nvmet_execute_rw(struct nvmet_req *req)
sector = le64_to_cpu(req->cmd->rw.slba);
sector <<= (req->ns->blksize_shift - 9);
nvmet_inline_bio_init(req);
bio = &req->inline_bio;
bio_init(bio, req->inline_bvec, ARRAY_SIZE(req->inline_bvec));
bio_set_dev(bio, req->ns->bdev);
bio->bi_iter.bi_sector = sector;
bio->bi_private = req;
@ -94,16 +86,14 @@ static void nvmet_execute_rw(struct nvmet_req *req)
cookie = submit_bio(bio);
blk_mq_poll(bdev_get_queue(req->ns->bdev), cookie);
blk_poll(bdev_get_queue(req->ns->bdev), cookie);
}
static void nvmet_execute_flush(struct nvmet_req *req)
{
struct bio *bio;
nvmet_inline_bio_init(req);
bio = &req->inline_bio;
struct bio *bio = &req->inline_bio;
bio_init(bio, req->inline_bvec, ARRAY_SIZE(req->inline_bvec));
bio_set_dev(bio, req->ns->bdev);
bio->bi_private = req;
bio->bi_end_io = nvmet_bio_done;

View File

@ -23,14 +23,6 @@
#define NVME_LOOP_MAX_SEGMENTS 256
/*
* We handle AEN commands ourselves and don't even let the
* block layer know about them.
*/
#define NVME_LOOP_NR_AEN_COMMANDS 1
#define NVME_LOOP_AQ_BLKMQ_DEPTH \
(NVME_AQ_DEPTH - NVME_LOOP_NR_AEN_COMMANDS)
struct nvme_loop_iod {
struct nvme_request nvme_req;
struct nvme_command cmd;
@ -53,7 +45,6 @@ struct nvme_loop_ctrl {
struct nvme_ctrl ctrl;
struct nvmet_ctrl *target_ctrl;
struct work_struct delete_work;
};
static inline struct nvme_loop_ctrl *to_loop_ctrl(struct nvme_ctrl *ctrl)
@ -113,7 +104,7 @@ static void nvme_loop_queue_response(struct nvmet_req *req)
* for them but rather special case them here.
*/
if (unlikely(nvme_loop_queue_idx(queue) == 0 &&
cqe->command_id >= NVME_LOOP_AQ_BLKMQ_DEPTH)) {
cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH)) {
nvme_complete_async_event(&queue->ctrl->ctrl, cqe->status,
&cqe->result);
} else {
@ -136,7 +127,7 @@ static void nvme_loop_execute_work(struct work_struct *work)
struct nvme_loop_iod *iod =
container_of(work, struct nvme_loop_iod, work);
iod->req.execute(&iod->req);
nvmet_req_execute(&iod->req);
}
static enum blk_eh_timer_return
@ -185,6 +176,7 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
iod->req.sg = iod->sg_table.sgl;
iod->req.sg_cnt = blk_rq_map_sg(req->q, req, iod->sg_table.sgl);
iod->req.transfer_len = blk_rq_bytes(req);
}
blk_mq_start_request(req);
@ -193,7 +185,7 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
return BLK_STS_OK;
}
static void nvme_loop_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
static void nvme_loop_submit_async_event(struct nvme_ctrl *arg)
{
struct nvme_loop_ctrl *ctrl = to_loop_ctrl(arg);
struct nvme_loop_queue *queue = &ctrl->queues[0];
@ -201,7 +193,7 @@ static void nvme_loop_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
memset(&iod->cmd, 0, sizeof(iod->cmd));
iod->cmd.common.opcode = nvme_admin_async_event;
iod->cmd.common.command_id = NVME_LOOP_AQ_BLKMQ_DEPTH;
iod->cmd.common.command_id = NVME_AQ_BLK_MQ_DEPTH;
iod->cmd.common.flags |= NVME_CMD_SGL_METABUF;
if (!nvmet_req_init(&iod->req, &queue->nvme_cq, &queue->nvme_sq,
@ -357,7 +349,7 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl)
memset(&ctrl->admin_tag_set, 0, sizeof(ctrl->admin_tag_set));
ctrl->admin_tag_set.ops = &nvme_loop_admin_mq_ops;
ctrl->admin_tag_set.queue_depth = NVME_LOOP_AQ_BLKMQ_DEPTH;
ctrl->admin_tag_set.queue_depth = NVME_AQ_MQ_TAG_DEPTH;
ctrl->admin_tag_set.reserved_tags = 2; /* connect + keep-alive */
ctrl->admin_tag_set.numa_node = NUMA_NO_NODE;
ctrl->admin_tag_set.cmd_size = sizeof(struct nvme_loop_iod) +
@ -365,6 +357,7 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl)
ctrl->admin_tag_set.driver_data = ctrl;
ctrl->admin_tag_set.nr_hw_queues = 1;
ctrl->admin_tag_set.timeout = ADMIN_TIMEOUT;
ctrl->admin_tag_set.flags = BLK_MQ_F_NO_SCHED;
ctrl->queues[0].ctrl = ctrl;
error = nvmet_sq_init(&ctrl->queues[0].nvme_sq);
@ -438,41 +431,9 @@ static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
nvme_loop_destroy_admin_queue(ctrl);
}
static void nvme_loop_del_ctrl_work(struct work_struct *work)
static void nvme_loop_delete_ctrl_host(struct nvme_ctrl *ctrl)
{
struct nvme_loop_ctrl *ctrl = container_of(work,
struct nvme_loop_ctrl, delete_work);
nvme_stop_ctrl(&ctrl->ctrl);
nvme_remove_namespaces(&ctrl->ctrl);
nvme_loop_shutdown_ctrl(ctrl);
nvme_uninit_ctrl(&ctrl->ctrl);
nvme_put_ctrl(&ctrl->ctrl);
}
static int __nvme_loop_del_ctrl(struct nvme_loop_ctrl *ctrl)
{
if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_DELETING))
return -EBUSY;
if (!queue_work(nvme_wq, &ctrl->delete_work))
return -EBUSY;
return 0;
}
static int nvme_loop_del_ctrl(struct nvme_ctrl *nctrl)
{
struct nvme_loop_ctrl *ctrl = to_loop_ctrl(nctrl);
int ret;
ret = __nvme_loop_del_ctrl(ctrl);
if (ret)
return ret;
flush_work(&ctrl->delete_work);
return 0;
nvme_loop_shutdown_ctrl(to_loop_ctrl(ctrl));
}
static void nvme_loop_delete_ctrl(struct nvmet_ctrl *nctrl)
@ -482,7 +443,7 @@ static void nvme_loop_delete_ctrl(struct nvmet_ctrl *nctrl)
mutex_lock(&nvme_loop_ctrl_mutex);
list_for_each_entry(ctrl, &nvme_loop_ctrl_list, list) {
if (ctrl->ctrl.cntlid == nctrl->cntlid)
__nvme_loop_del_ctrl(ctrl);
nvme_delete_ctrl(&ctrl->ctrl);
}
mutex_unlock(&nvme_loop_ctrl_mutex);
}
@ -538,7 +499,7 @@ static const struct nvme_ctrl_ops nvme_loop_ctrl_ops = {
.reg_write32 = nvmf_reg_write32,
.free_ctrl = nvme_loop_free_ctrl,
.submit_async_event = nvme_loop_submit_async_event,
.delete_ctrl = nvme_loop_del_ctrl,
.delete_ctrl = nvme_loop_delete_ctrl_host,
};
static int nvme_loop_create_io_queues(struct nvme_loop_ctrl *ctrl)
@ -600,7 +561,6 @@ static struct nvme_ctrl *nvme_loop_create_ctrl(struct device *dev,
ctrl->ctrl.opts = opts;
INIT_LIST_HEAD(&ctrl->list);
INIT_WORK(&ctrl->delete_work, nvme_loop_del_ctrl_work);
INIT_WORK(&ctrl->ctrl.reset_work, nvme_loop_reset_ctrl_work);
ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_loop_ctrl_ops,
@ -641,7 +601,7 @@ static struct nvme_ctrl *nvme_loop_create_ctrl(struct device *dev,
dev_info(ctrl->ctrl.device,
"new ctrl: \"%s\"\n", ctrl->ctrl.opts->subsysnqn);
kref_get(&ctrl->ctrl.kref);
nvme_get_ctrl(&ctrl->ctrl);
changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
WARN_ON_ONCE(!changed);
@ -730,7 +690,7 @@ static void __exit nvme_loop_cleanup_module(void)
mutex_lock(&nvme_loop_ctrl_mutex);
list_for_each_entry_safe(ctrl, next, &nvme_loop_ctrl_list, list)
__nvme_loop_del_ctrl(ctrl);
nvme_delete_ctrl(&ctrl->ctrl);
mutex_unlock(&nvme_loop_ctrl_mutex);
flush_workqueue(nvme_wq);

View File

@ -223,7 +223,10 @@ struct nvmet_req {
struct bio inline_bio;
struct bio_vec inline_bvec[NVMET_MAX_INLINE_BIOVEC];
int sg_cnt;
/* data length as parsed from the command: */
size_t data_len;
/* data length as parsed from the SGL descriptor: */
size_t transfer_len;
struct nvmet_port *port;
@ -266,6 +269,7 @@ u16 nvmet_parse_fabrics_cmd(struct nvmet_req *req);
bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq,
struct nvmet_sq *sq, struct nvmet_fabrics_ops *ops);
void nvmet_req_uninit(struct nvmet_req *req);
void nvmet_req_execute(struct nvmet_req *req);
void nvmet_req_complete(struct nvmet_req *req, u16 status);
void nvmet_cq_setup(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq, u16 qid,
@ -314,7 +318,7 @@ u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf,
u32 nvmet_get_log_page_len(struct nvme_command *cmd);
#define NVMET_QUEUE_SIZE 1024
#define NVMET_NR_QUEUES 64
#define NVMET_NR_QUEUES 128
#define NVMET_MAX_CMD NVMET_QUEUE_SIZE
#define NVMET_KAS 10
#define NVMET_DISC_KATO 120

View File

@ -148,14 +148,14 @@ static inline u32 get_unaligned_le24(const u8 *p)
static inline bool nvmet_rdma_need_data_in(struct nvmet_rdma_rsp *rsp)
{
return nvme_is_write(rsp->req.cmd) &&
rsp->req.data_len &&
rsp->req.transfer_len &&
!(rsp->flags & NVMET_RDMA_REQ_INLINE_DATA);
}
static inline bool nvmet_rdma_need_data_out(struct nvmet_rdma_rsp *rsp)
{
return !nvme_is_write(rsp->req.cmd) &&
rsp->req.data_len &&
rsp->req.transfer_len &&
!rsp->req.rsp->status &&
!(rsp->flags & NVMET_RDMA_REQ_INLINE_DATA);
}
@ -577,7 +577,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
return;
}
rsp->req.execute(&rsp->req);
nvmet_req_execute(&rsp->req);
}
static void nvmet_rdma_use_inline_sg(struct nvmet_rdma_rsp *rsp, u32 len,
@ -609,6 +609,7 @@ static u16 nvmet_rdma_map_sgl_inline(struct nvmet_rdma_rsp *rsp)
nvmet_rdma_use_inline_sg(rsp, len, off);
rsp->flags |= NVMET_RDMA_REQ_INLINE_DATA;
rsp->req.transfer_len += len;
return 0;
}
@ -636,6 +637,7 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
nvmet_data_dir(&rsp->req));
if (ret < 0)
return NVME_SC_INTERNAL;
rsp->req.transfer_len += len;
rsp->n_rdma += ret;
if (invalidate) {
@ -693,7 +695,7 @@ static bool nvmet_rdma_execute_command(struct nvmet_rdma_rsp *rsp)
queue->cm_id->port_num, &rsp->read_cqe, NULL))
nvmet_req_complete(&rsp->req, NVME_SC_DATA_XFER_ERROR);
} else {
rsp->req.execute(&rsp->req);
nvmet_req_execute(&rsp->req);
}
return true;
@ -1512,15 +1514,17 @@ static struct nvmet_fabrics_ops nvmet_rdma_ops = {
static void nvmet_rdma_remove_one(struct ib_device *ib_device, void *client_data)
{
struct nvmet_rdma_queue *queue;
struct nvmet_rdma_queue *queue, *tmp;
/* Device is being removed, delete all queues using this device */
mutex_lock(&nvmet_rdma_queue_mutex);
list_for_each_entry(queue, &nvmet_rdma_queue_list, queue_list) {
list_for_each_entry_safe(queue, tmp, &nvmet_rdma_queue_list,
queue_list) {
if (queue->dev->device != ib_device)
continue;
pr_info("Removing queue %d\n", queue->idx);
list_del_init(&queue->queue_list);
__nvmet_rdma_queue_disconnect(queue);
}
mutex_unlock(&nvmet_rdma_queue_mutex);

View File

@ -130,7 +130,8 @@ config CHR_DEV_OSST
config BLK_DEV_SR
tristate "SCSI CDROM support"
depends on SCSI
depends on SCSI && BLK_DEV
select CDROM
---help---
If you want to use a CD or DVD drive attached to your computer
by SCSI, FireWire, USB or ATAPI, say Y and read the SCSI-HOWTO

View File

@ -3246,6 +3246,11 @@ lpfc_update_rport_devloss_tmo(struct lpfc_vport *vport)
continue;
if (ndlp->rport)
ndlp->rport->dev_loss_tmo = vport->cfg_devloss_tmo;
#if (IS_ENABLED(CONFIG_NVME_FC))
if (ndlp->nrport)
nvme_fc_set_remoteport_devloss(ndlp->nrport->remoteport,
vport->cfg_devloss_tmo);
#endif
}
spin_unlock_irq(shost->host_lock);
}

View File

@ -252,9 +252,9 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
struct scsi_request *rq;
int ret = DRIVER_ERROR << 24;
req = blk_get_request(sdev->request_queue,
req = blk_get_request_flags(sdev->request_queue,
data_direction == DMA_TO_DEVICE ?
REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
if (IS_ERR(req))
return ret;
rq = scsi_req(req);
@ -268,7 +268,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
rq->retries = retries;
req->timeout = timeout;
req->cmd_flags |= flags;
req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT;
req->rq_flags |= rq_flags | RQF_QUIET;
/*
* head injection *required* here otherwise quiesce won't work
@ -1301,7 +1301,7 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
/*
* If the devices is blocked we defer normal commands.
*/
if (!(req->rq_flags & RQF_PREEMPT))
if (req && !(req->rq_flags & RQF_PREEMPT))
ret = BLKPREP_DEFER;
break;
default:
@ -1310,7 +1310,7 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
* special commands. In particular any user initiated
* command is not allowed.
*/
if (!(req->rq_flags & RQF_PREEMPT))
if (req && !(req->rq_flags & RQF_PREEMPT))
ret = BLKPREP_KILL;
break;
}
@ -1940,6 +1940,33 @@ static void scsi_mq_done(struct scsi_cmnd *cmd)
blk_mq_complete_request(cmd->request);
}
static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
struct scsi_device *sdev = q->queuedata;
atomic_dec(&sdev->device_busy);
put_device(&sdev->sdev_gendev);
}
static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
struct scsi_device *sdev = q->queuedata;
if (!get_device(&sdev->sdev_gendev))
goto out;
if (!scsi_dev_queue_ready(q, sdev))
goto out_put_device;
return true;
out_put_device:
put_device(&sdev->sdev_gendev);
out:
return false;
}
static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
@ -1953,16 +1980,11 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
ret = prep_to_mq(scsi_prep_state_check(sdev, req));
if (ret != BLK_STS_OK)
goto out;
goto out_put_budget;
ret = BLK_STS_RESOURCE;
if (!get_device(&sdev->sdev_gendev))
goto out;
if (!scsi_dev_queue_ready(q, sdev))
goto out_put_device;
if (!scsi_target_queue_ready(shost, sdev))
goto out_dec_device_busy;
goto out_put_budget;
if (!scsi_host_queue_ready(q, shost, sdev))
goto out_dec_target_busy;
@ -1993,15 +2015,12 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
return BLK_STS_OK;
out_dec_host_busy:
atomic_dec(&shost->host_busy);
atomic_dec(&shost->host_busy);
out_dec_target_busy:
if (scsi_target(sdev)->can_queue > 0)
atomic_dec(&scsi_target(sdev)->target_busy);
out_dec_device_busy:
atomic_dec(&sdev->device_busy);
out_put_device:
put_device(&sdev->sdev_gendev);
out:
out_put_budget:
scsi_mq_put_budget(hctx);
switch (ret) {
case BLK_STS_OK:
break;
@ -2205,6 +2224,8 @@ struct request_queue *scsi_old_alloc_queue(struct scsi_device *sdev)
}
static const struct blk_mq_ops scsi_mq_ops = {
.get_budget = scsi_mq_get_budget,
.put_budget = scsi_mq_put_budget,
.queue_rq = scsi_queue_rq,
.complete = scsi_softirq_done,
.timeout = scsi_timeout,
@ -2919,21 +2940,37 @@ static void scsi_wait_for_queuecommand(struct scsi_device *sdev)
int
scsi_device_quiesce(struct scsi_device *sdev)
{
struct request_queue *q = sdev->request_queue;
int err;
/*
* It is allowed to call scsi_device_quiesce() multiple times from
* the same context but concurrent scsi_device_quiesce() calls are
* not allowed.
*/
WARN_ON_ONCE(sdev->quiesced_by && sdev->quiesced_by != current);
blk_set_preempt_only(q);
blk_mq_freeze_queue(q);
/*
* Ensure that the effect of blk_set_preempt_only() will be visible
* for percpu_ref_tryget() callers that occur after the queue
* unfreeze even if the queue was already frozen before this function
* was called. See also https://lwn.net/Articles/573497/.
*/
synchronize_rcu();
blk_mq_unfreeze_queue(q);
mutex_lock(&sdev->state_mutex);
err = scsi_device_set_state(sdev, SDEV_QUIESCE);
if (err == 0)
sdev->quiesced_by = current;
else
blk_clear_preempt_only(q);
mutex_unlock(&sdev->state_mutex);
if (err)
return err;
scsi_run_queue(sdev->request_queue);
while (atomic_read(&sdev->device_busy)) {
msleep_interruptible(200);
scsi_run_queue(sdev->request_queue);
}
return 0;
return err;
}
EXPORT_SYMBOL(scsi_device_quiesce);
@ -2953,9 +2990,11 @@ void scsi_device_resume(struct scsi_device *sdev)
* device deleted during suspend)
*/
mutex_lock(&sdev->state_mutex);
if (sdev->sdev_state == SDEV_QUIESCE &&
scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
scsi_run_queue(sdev->request_queue);
WARN_ON_ONCE(!sdev->quiesced_by);
sdev->quiesced_by = NULL;
blk_clear_preempt_only(sdev->request_queue);
if (sdev->sdev_state == SDEV_QUIESCE)
scsi_device_set_state(sdev, SDEV_RUNNING);
mutex_unlock(&sdev->state_mutex);
}
EXPORT_SYMBOL(scsi_device_resume);

View File

@ -217,7 +217,7 @@ static int sg_allow_access(struct file *filp, unsigned char *cmd)
if (sfp->parentdp->device->type == TYPE_SCANNER)
return 0;
return blk_verify_command(cmd, filp->f_mode & FMODE_WRITE);
return blk_verify_command(cmd, filp->f_mode);
}
static int

View File

@ -54,18 +54,6 @@ struct block_device *I_BDEV(struct inode *inode)
}
EXPORT_SYMBOL(I_BDEV);
void __vfs_msg(struct super_block *sb, const char *prefix, const char *fmt, ...)
{
struct va_format vaf;
va_list args;
va_start(args, fmt);
vaf.fmt = fmt;
vaf.va = &args;
printk_ratelimited("%sVFS (%s): %pV\n", prefix, sb->s_id, &vaf);
va_end(args);
}
static void bdev_write_inode(struct block_device *bdev)
{
struct inode *inode = bdev->bd_inode;
@ -249,7 +237,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
if (!READ_ONCE(bio.bi_private))
break;
if (!(iocb->ki_flags & IOCB_HIPRI) ||
!blk_mq_poll(bdev_get_queue(bdev), qc))
!blk_poll(bdev_get_queue(bdev), qc))
io_schedule();
}
__set_current_state(TASK_RUNNING);
@ -414,7 +402,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
break;
if (!(iocb->ki_flags & IOCB_HIPRI) ||
!blk_mq_poll(bdev_get_queue(bdev), qc))
!blk_poll(bdev_get_queue(bdev), qc))
io_schedule();
}
__set_current_state(TASK_RUNNING);
@ -674,7 +662,7 @@ int bdev_read_page(struct block_device *bdev, sector_t sector,
if (!ops->rw_page || bdev_get_integrity(bdev))
return result;
result = blk_queue_enter(bdev->bd_queue, false);
result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, false);
@ -710,7 +698,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
result = blk_queue_enter(bdev->bd_queue, false);
result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;

View File

@ -252,27 +252,6 @@ out:
return ret;
}
/*
* Kick the writeback threads then try to free up some ZONE_NORMAL memory.
*/
static void free_more_memory(void)
{
struct zoneref *z;
int nid;
wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM);
yield();
for_each_online_node(nid) {
z = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
gfp_zone(GFP_NOFS), NULL);
if (z->zone)
try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
GFP_NOFS, NULL);
}
}
/*
* I/O completion handler for block_read_full_page() - pages
* which come unlocked at the end of I/O.
@ -861,16 +840,19 @@ int remove_inode_buffers(struct inode *inode)
* which may not fail from ordinary buffer allocations.
*/
struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
int retry)
bool retry)
{
struct buffer_head *bh, *head;
gfp_t gfp = GFP_NOFS;
long offset;
try_again:
if (retry)
gfp |= __GFP_NOFAIL;
head = NULL;
offset = PAGE_SIZE;
while ((offset -= size) >= 0) {
bh = alloc_buffer_head(GFP_NOFS);
bh = alloc_buffer_head(gfp);
if (!bh)
goto no_grow;
@ -896,23 +878,7 @@ no_grow:
} while (head);
}
/*
* Return failure for non-async IO requests. Async IO requests
* are not allowed to fail, so we have to wait until buffer heads
* become available. But we don't want tasks sleeping with
* partially complete buffers, so all were released above.
*/
if (!retry)
return NULL;
/* We're _really_ low on memory. Now we just
* wait for old buffer heads to become free due to
* finishing IO. Since this is an async request and
* the reserve list is empty, we're sure there are
* async buffer heads in use.
*/
free_more_memory();
goto try_again;
return NULL;
}
EXPORT_SYMBOL_GPL(alloc_page_buffers);
@ -1001,8 +967,6 @@ grow_dev_page(struct block_device *bdev, sector_t block,
gfp_mask |= __GFP_NOFAIL;
page = find_or_create_page(inode->i_mapping, index, gfp_mask);
if (!page)
return ret;
BUG_ON(!PageLocked(page));
@ -1021,9 +985,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
/*
* Allocate some buffers for this page
*/
bh = alloc_page_buffers(page, size, 0);
if (!bh)
goto failed;
bh = alloc_page_buffers(page, size, true);
/*
* Link the page to the buffers and initialise them. Take the
@ -1103,8 +1065,6 @@ __getblk_slow(struct block_device *bdev, sector_t block,
ret = grow_buffers(bdev, block, size, gfp);
if (ret < 0)
return NULL;
if (ret == 0)
free_more_memory();
}
}
@ -1575,7 +1535,7 @@ void create_empty_buffers(struct page *page,
{
struct buffer_head *bh, *head, *tail;
head = alloc_page_buffers(page, blocksize, 1);
head = alloc_page_buffers(page, blocksize, true);
bh = head;
do {
bh->b_state |= b_state;
@ -2639,7 +2599,7 @@ int nobh_write_begin(struct address_space *mapping,
* Be careful: the buffer linked list is a NULL terminated one, rather
* than the circular one we're used to.
*/
head = alloc_page_buffers(page, blocksize, 0);
head = alloc_page_buffers(page, blocksize, false);
if (!head) {
ret = -ENOMEM;
goto out_release;
@ -3056,8 +3016,16 @@ void guard_bio_eod(int op, struct bio *bio)
sector_t maxsector;
struct bio_vec *bvec = &bio->bi_io_vec[bio->bi_vcnt - 1];
unsigned truncated_bytes;
struct hd_struct *part;
rcu_read_lock();
part = __disk_get_part(bio->bi_disk, bio->bi_partno);
if (part)
maxsector = part_nr_sects_read(part);
else
maxsector = get_capacity(bio->bi_disk);
rcu_read_unlock();
maxsector = get_capacity(bio->bi_disk);
if (!maxsector)
return;

Some files were not shown because too many files have changed in this diff Show More