License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2013-03-24 07:11:31 +08:00
|
|
|
/*
|
|
|
|
* Main bcache entry point - handle a read or a write request and decide what to
|
|
|
|
* do with it; the make_request functions are called by the block layer.
|
|
|
|
*
|
|
|
|
* Copyright 2010, 2011 Kent Overstreet <kent.overstreet@gmail.com>
|
|
|
|
* Copyright 2012 Google, Inc.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include "bcache.h"
|
|
|
|
#include "btree.h"
|
|
|
|
#include "debug.h"
|
|
|
|
#include "request.h"
|
2013-06-05 21:21:07 +08:00
|
|
|
#include "writeback.h"
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/hash.h>
|
|
|
|
#include <linux/random.h>
|
2015-05-23 05:13:32 +08:00
|
|
|
#include <linux/backing-dev.h>
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
#include <trace/events/bcache.h>
|
|
|
|
|
|
|
|
#define CUTOFF_CACHE_ADD 95
|
|
|
|
#define CUTOFF_CACHE_READA 90
|
|
|
|
|
|
|
|
struct kmem_cache *bch_search_cache;
|
|
|
|
|
2018-08-11 13:19:46 +08:00
|
|
|
static void bch_data_insert_start(struct closure *cl);
|
2013-10-25 08:07:04 +08:00
|
|
|
|
2018-08-11 13:19:44 +08:00
|
|
|
static unsigned int cache_mode(struct cached_dev *dc)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
return BDEV_CACHE_MODE(&dc->sb);
|
|
|
|
}
|
|
|
|
|
2017-10-14 07:35:34 +08:00
|
|
|
static bool verify(struct cached_dev *dc)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
return dc->verify;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bio_csum(struct bio *bio, struct bkey *k)
|
|
|
|
{
|
2013-11-24 09:19:00 +08:00
|
|
|
struct bio_vec bv;
|
|
|
|
struct bvec_iter iter;
|
2013-03-24 07:11:31 +08:00
|
|
|
uint64_t csum = 0;
|
|
|
|
|
2013-11-24 09:19:00 +08:00
|
|
|
bio_for_each_segment(bv, bio, iter) {
|
2022-03-03 19:19:02 +08:00
|
|
|
void *d = bvec_kmap_local(&bv);
|
2018-08-11 13:19:45 +08:00
|
|
|
|
2021-10-20 22:38:12 +08:00
|
|
|
csum = crc64_be(csum, d, bv.bv_len);
|
2022-03-03 19:19:02 +08:00
|
|
|
kunmap_local(d);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
k->ptr[KEY_PTRS(k)] = csum & (~0ULL >> 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Insert data into cache */
|
|
|
|
|
2013-10-25 08:07:04 +08:00
|
|
|
static void bch_data_insert_keys(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-09-11 10:02:45 +08:00
|
|
|
struct data_insert_op *op = container_of(cl, struct data_insert_op, cl);
|
2013-07-25 08:44:17 +08:00
|
|
|
atomic_t *journal_ref = NULL;
|
2013-09-11 10:02:45 +08:00
|
|
|
struct bkey *replace_key = op->replace ? &op->replace_key : NULL;
|
2013-07-25 09:06:22 +08:00
|
|
|
int ret;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (!op->replace)
|
|
|
|
journal_ref = bch_journal(op->c, &op->insert_keys,
|
|
|
|
op->flush_journal ? cl : NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
ret = bch_btree_insert(op->c, &op->insert_keys,
|
2013-07-25 09:06:22 +08:00
|
|
|
journal_ref, replace_key);
|
|
|
|
if (ret == -ESRCH) {
|
2013-09-11 10:02:45 +08:00
|
|
|
op->replace_collision = true;
|
2013-07-25 09:06:22 +08:00
|
|
|
} else if (ret) {
|
2017-06-03 15:38:06 +08:00
|
|
|
op->status = BLK_STS_RESOURCE;
|
2013-09-11 10:02:45 +08:00
|
|
|
op->insert_data_done = true;
|
2013-10-25 08:07:04 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:44:17 +08:00
|
|
|
if (journal_ref)
|
|
|
|
atomic_dec_bug(journal_ref);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2015-03-06 23:37:46 +08:00
|
|
|
if (!op->insert_data_done) {
|
2014-01-10 08:03:04 +08:00
|
|
|
continue_at(cl, bch_data_insert_start, op->wq);
|
2015-03-06 23:37:46 +08:00
|
|
|
return;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
bch_keylist_free(&op->insert_keys);
|
2013-10-25 08:07:04 +08:00
|
|
|
closure_return(cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2018-08-11 13:19:44 +08:00
|
|
|
static int bch_keylist_realloc(struct keylist *l, unsigned int u64s,
|
2013-11-12 10:20:51 +08:00
|
|
|
struct cache_set *c)
|
|
|
|
{
|
|
|
|
size_t oldsize = bch_keylist_nkeys(l);
|
|
|
|
size_t newsize = oldsize + u64s;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The journalling code doesn't handle the case where the keys to insert
|
|
|
|
* is bigger than an empty write: If we just return -ENOMEM here,
|
2018-08-09 15:48:47 +08:00
|
|
|
* bch_data_insert_keys() will insert the keys created so far
|
2013-11-12 10:20:51 +08:00
|
|
|
* and finish the rest when the keylist is empty.
|
|
|
|
*/
|
2020-10-01 14:50:49 +08:00
|
|
|
if (newsize * sizeof(uint64_t) > block_bytes(c->cache) - sizeof(struct jset))
|
2013-11-12 10:20:51 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
return __bch_keylist_realloc(l, u64s);
|
|
|
|
}
|
|
|
|
|
2013-10-25 08:07:04 +08:00
|
|
|
static void bch_data_invalidate(struct closure *cl)
|
|
|
|
{
|
2013-09-11 10:02:45 +08:00
|
|
|
struct data_insert_op *op = container_of(cl, struct data_insert_op, cl);
|
|
|
|
struct bio *bio = op->bio;
|
2013-10-25 08:07:04 +08:00
|
|
|
|
2020-05-27 12:01:52 +08:00
|
|
|
pr_debug("invalidating %i sectors from %llu\n",
|
2013-10-12 06:44:27 +08:00
|
|
|
bio_sectors(bio), (uint64_t) bio->bi_iter.bi_sector);
|
2013-10-25 08:07:04 +08:00
|
|
|
|
|
|
|
while (bio_sectors(bio)) {
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int sectors = min(bio_sectors(bio),
|
2013-11-01 06:46:42 +08:00
|
|
|
1U << (KEY_SIZE_BITS - 1));
|
2013-10-25 08:07:04 +08:00
|
|
|
|
2013-11-12 10:20:51 +08:00
|
|
|
if (bch_keylist_realloc(&op->insert_keys, 2, op->c))
|
2013-10-25 08:07:04 +08:00
|
|
|
goto out;
|
|
|
|
|
2013-10-12 06:44:27 +08:00
|
|
|
bio->bi_iter.bi_sector += sectors;
|
|
|
|
bio->bi_iter.bi_size -= sectors << 9;
|
2013-10-25 08:07:04 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
bch_keylist_add(&op->insert_keys,
|
2018-08-11 13:19:47 +08:00
|
|
|
&KEY(op->inode,
|
|
|
|
bio->bi_iter.bi_sector,
|
|
|
|
sectors));
|
2013-10-25 08:07:04 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
op->insert_data_done = true;
|
2018-03-19 08:36:24 +08:00
|
|
|
/* get in bch_data_insert() */
|
2013-10-25 08:07:04 +08:00
|
|
|
bio_put(bio);
|
|
|
|
out:
|
2014-01-10 08:03:04 +08:00
|
|
|
continue_at(cl, bch_data_insert_keys, op->wq);
|
2013-10-25 08:07:04 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void bch_data_insert_error(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-09-11 10:02:45 +08:00
|
|
|
struct data_insert_op *op = container_of(cl, struct data_insert_op, cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Our data write just errored, which means we've got a bunch of keys to
|
2018-08-11 13:19:52 +08:00
|
|
|
* insert that point to data that wasn't successfully written.
|
2013-03-24 07:11:31 +08:00
|
|
|
*
|
|
|
|
* We don't have to insert those keys but we still have to invalidate
|
|
|
|
* that region of the cache - so, if we just strip off all the pointers
|
|
|
|
* from the keys we'll accomplish just that.
|
|
|
|
*/
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
struct bkey *src = op->insert_keys.keys, *dst = op->insert_keys.keys;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
while (src != op->insert_keys.top) {
|
2013-03-24 07:11:31 +08:00
|
|
|
struct bkey *n = bkey_next(src);
|
|
|
|
|
|
|
|
SET_KEY_PTRS(src, 0);
|
2013-07-25 08:24:25 +08:00
|
|
|
memmove(dst, src, bkey_bytes(src));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
dst = bkey_next(dst);
|
|
|
|
src = n;
|
|
|
|
}
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
op->insert_keys.top = dst;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-10-25 08:07:04 +08:00
|
|
|
bch_data_insert_keys(cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2015-07-20 21:29:37 +08:00
|
|
|
static void bch_data_insert_endio(struct bio *bio)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct closure *cl = bio->bi_private;
|
2013-09-11 10:02:45 +08:00
|
|
|
struct data_insert_op *op = container_of(cl, struct data_insert_op, cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
if (bio->bi_status) {
|
2013-03-24 07:11:31 +08:00
|
|
|
/* TODO: We could try to recover from this. */
|
2013-09-11 10:02:45 +08:00
|
|
|
if (op->writeback)
|
2017-06-03 15:38:06 +08:00
|
|
|
op->status = bio->bi_status;
|
2013-09-11 10:02:45 +08:00
|
|
|
else if (!op->replace)
|
2014-01-10 08:03:04 +08:00
|
|
|
set_closure_fn(cl, bch_data_insert_error, op->wq);
|
2013-03-24 07:11:31 +08:00
|
|
|
else
|
|
|
|
set_closure_fn(cl, NULL, NULL);
|
|
|
|
}
|
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
bch_bbio_endio(op->c, bio, bio->bi_status, "writing data to cache");
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-10-25 08:07:04 +08:00
|
|
|
static void bch_data_insert_start(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-09-11 10:02:45 +08:00
|
|
|
struct data_insert_op *op = container_of(cl, struct data_insert_op, cl);
|
|
|
|
struct bio *bio = op->bio, *n;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-13 04:53:28 +08:00
|
|
|
if (op->bypass)
|
|
|
|
return bch_data_invalidate(cl);
|
|
|
|
|
2017-09-06 14:25:53 +08:00
|
|
|
if (atomic_sub_return(bio_sectors(bio), &op->c->sectors_to_gc) < 0)
|
|
|
|
wake_up_gc(op->c);
|
|
|
|
|
2013-07-11 09:44:40 +08:00
|
|
|
/*
|
2016-06-06 03:32:25 +08:00
|
|
|
* Journal writes are marked REQ_PREFLUSH; if the original write was a
|
2013-07-11 09:44:40 +08:00
|
|
|
* flush, it'll wait on the journal write.
|
|
|
|
*/
|
2016-08-06 05:35:16 +08:00
|
|
|
bio->bi_opf &= ~(REQ_PREFLUSH|REQ_FUA);
|
2013-07-11 09:44:40 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
do {
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i;
|
2013-03-24 07:11:31 +08:00
|
|
|
struct bkey *k;
|
2018-05-21 06:25:51 +08:00
|
|
|
struct bio_set *split = &op->c->bio_split;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
/* 1 for the device pointer and 1 for the chksum */
|
2013-09-11 10:02:45 +08:00
|
|
|
if (bch_keylist_realloc(&op->insert_keys,
|
2013-11-12 10:20:51 +08:00
|
|
|
3 + (op->csum ? 1 : 0),
|
2015-03-06 23:37:46 +08:00
|
|
|
op->c)) {
|
2014-01-10 08:03:04 +08:00
|
|
|
continue_at(cl, bch_data_insert_keys, op->wq);
|
2015-03-06 23:37:46 +08:00
|
|
|
return;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
k = op->insert_keys.top;
|
2013-03-24 07:11:31 +08:00
|
|
|
bkey_init(k);
|
2013-09-11 10:02:45 +08:00
|
|
|
SET_KEY_INODE(k, op->inode);
|
2013-10-12 06:44:27 +08:00
|
|
|
SET_KEY_OFFSET(k, bio->bi_iter.bi_sector);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:11:11 +08:00
|
|
|
if (!bch_alloc_sectors(op->c, k, bio_sectors(bio),
|
|
|
|
op->write_point, op->write_prio,
|
|
|
|
op->writeback))
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
2013-11-24 10:21:01 +08:00
|
|
|
n = bio_next_split(bio, KEY_SIZE(k), GFP_NOIO, split);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-10-25 08:07:04 +08:00
|
|
|
n->bi_end_io = bch_data_insert_endio;
|
2013-03-24 07:11:31 +08:00
|
|
|
n->bi_private = cl;
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (op->writeback) {
|
2013-03-24 07:11:31 +08:00
|
|
|
SET_KEY_DIRTY(k, true);
|
|
|
|
|
|
|
|
for (i = 0; i < KEY_PTRS(k); i++)
|
2013-09-11 10:02:45 +08:00
|
|
|
SET_GC_MARK(PTR_BUCKET(op->c, k, i),
|
2013-03-24 07:11:31 +08:00
|
|
|
GC_MARK_DIRTY);
|
|
|
|
}
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
SET_KEY_CSUM(k, op->csum);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (KEY_CSUM(k))
|
|
|
|
bio_csum(n, k);
|
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_cache_insert(k);
|
2013-09-11 10:02:45 +08:00
|
|
|
bch_keylist_push(&op->insert_keys);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2022-12-06 22:40:57 +08:00
|
|
|
n->bi_opf = REQ_OP_WRITE;
|
2013-09-11 10:02:45 +08:00
|
|
|
bch_submit_bbio(n, op->c, k, 0);
|
2013-03-24 07:11:31 +08:00
|
|
|
} while (n != bio);
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
op->insert_data_done = true;
|
2014-01-10 08:03:04 +08:00
|
|
|
continue_at(cl, bch_data_insert_keys, op->wq);
|
2015-03-06 23:37:46 +08:00
|
|
|
return;
|
2013-03-24 07:11:31 +08:00
|
|
|
err:
|
|
|
|
/* bch_alloc_sectors() blocks if s->writeback = true */
|
2013-09-11 10:02:45 +08:00
|
|
|
BUG_ON(op->writeback);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* But if it's not a writeback write we'd rather just bail out if
|
|
|
|
* there aren't any buckets ready to write to - it might take awhile and
|
|
|
|
* we might be starving btree writes for gc or something.
|
|
|
|
*/
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (!op->replace) {
|
2013-03-24 07:11:31 +08:00
|
|
|
/*
|
|
|
|
* Writethrough write: We can't complete the write until we've
|
|
|
|
* updated the index. But we don't want to delay the write while
|
|
|
|
* we wait for buckets to be freed up, so just invalidate the
|
|
|
|
* rest of the write.
|
|
|
|
*/
|
2013-09-11 10:02:45 +08:00
|
|
|
op->bypass = true;
|
2013-10-25 08:07:04 +08:00
|
|
|
return bch_data_invalidate(cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* From a cache miss, we can just insert the keys for the data
|
|
|
|
* we have written or bail out if we didn't do anything.
|
|
|
|
*/
|
2013-09-11 10:02:45 +08:00
|
|
|
op->insert_data_done = true;
|
2013-03-24 07:11:31 +08:00
|
|
|
bio_put(bio);
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (!bch_keylist_empty(&op->insert_keys))
|
2014-01-10 08:03:04 +08:00
|
|
|
continue_at(cl, bch_data_insert_keys, op->wq);
|
2013-03-24 07:11:31 +08:00
|
|
|
else
|
|
|
|
closure_return(cl);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2013-10-25 08:07:04 +08:00
|
|
|
* bch_data_insert - stick some data in the cache
|
2018-03-19 08:36:29 +08:00
|
|
|
* @cl: closure pointer.
|
2013-03-24 07:11:31 +08:00
|
|
|
*
|
|
|
|
* This is the starting point for any data to end up in a cache device; it could
|
|
|
|
* be from a normal write, or a writeback write, or a write to a flash only
|
|
|
|
* volume - it's also used by the moving garbage collector to compact data in
|
|
|
|
* mostly empty buckets.
|
|
|
|
*
|
|
|
|
* It first writes the data to the cache, creating a list of keys to be inserted
|
|
|
|
* (if the data had to be fragmented there will be multiple keys); after the
|
|
|
|
* data is written it calls bch_journal, and after the keys have been added to
|
|
|
|
* the next journal write they're inserted into the btree.
|
|
|
|
*
|
2018-12-13 22:53:48 +08:00
|
|
|
* It inserts the data in op->bio; bi_sector is used for the key offset,
|
2013-03-24 07:11:31 +08:00
|
|
|
* and op->inode is used for the key inode.
|
|
|
|
*
|
2018-12-13 22:53:48 +08:00
|
|
|
* If op->bypass is true, instead of inserting the data it invalidates the
|
|
|
|
* region of the cache represented by op->bio and op->inode.
|
2013-03-24 07:11:31 +08:00
|
|
|
*/
|
2013-10-25 08:07:04 +08:00
|
|
|
void bch_data_insert(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-09-11 10:02:45 +08:00
|
|
|
struct data_insert_op *op = container_of(cl, struct data_insert_op, cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-05-23 03:14:24 +08:00
|
|
|
trace_bcache_write(op->c, op->inode, op->bio,
|
|
|
|
op->writeback, op->bypass);
|
2013-09-11 10:02:45 +08:00
|
|
|
|
|
|
|
bch_keylist_init(&op->insert_keys);
|
|
|
|
bio_get(op->bio);
|
2013-10-25 08:07:04 +08:00
|
|
|
bch_data_insert_start(cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
bcache: Clean up bch_get_congested()
There are a few nits in this function. They could in theory all
be separate patches, but that's probably taking small commits
too far.
1) I added a brief comment saying what it does.
2) I like to declare pointer parameters "const" where possible
for documentation reasons.
3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming
weight of a 32-bit random number (giving a random integer with
mean 16 and variance 8). Passing by reference in a 64-bit variable
is silly; just use hweight32().
4) Its helper function fract_exp_two is unnecessarily tangled.
Gcc can optimize the multiply by (1 << x) to a shift, but it can
be written in a much more straightforward way at the cost of one
more bit of internal precision. Some analysis reveals that this
bit is always available.
This shrinks the object code for fract_exp_two(x, 6) from 23 bytes:
0000000000000000 <foo1>:
0: 89 f9 mov %edi,%ecx
2: c1 e9 06 shr $0x6,%ecx
5: b8 01 00 00 00 mov $0x1,%eax
a: d3 e0 shl %cl,%eax
c: 83 e7 3f and $0x3f,%edi
f: d3 e7 shl %cl,%edi
11: c1 ef 06 shr $0x6,%edi
14: 01 f8 add %edi,%eax
16: c3 retq
To 19:
0000000000000017 <foo2>:
17: 89 f8 mov %edi,%eax
19: 83 e0 3f and $0x3f,%eax
1c: 83 c0 40 add $0x40,%eax
1f: 89 f9 mov %edi,%ecx
21: c1 e9 06 shr $0x6,%ecx
24: d3 e0 shl %cl,%eax
26: c1 e8 06 shr $0x6,%eax
29: c3 retq
(Verified with 0 <= frac_bits <= 8, 0 <= x < 16<<frac_bits;
both versions produce the same output.)
5) And finally, the call to bch_get_congested() in check_should_bypass()
is separated from the use of the value by multiple tests which
could moot the need to compute it. Move the computation down to
where it's needed. This also saves a local register to hold the
computed value.
Signed-off-by: George Spelvin <lkml@sdf.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-25 00:48:30 +08:00
|
|
|
/*
|
|
|
|
* Congested? Return 0 (not congested) or the limit (in sectors)
|
|
|
|
* beyond which we should bypass the cache due to congestion.
|
|
|
|
*/
|
|
|
|
unsigned int bch_get_congested(const struct cache_set *c)
|
2013-09-11 10:02:45 +08:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (!c->congested_read_threshold_us &&
|
|
|
|
!c->congested_write_threshold_us)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
i = (local_clock_us() - c->congested_last_us) / 1024;
|
|
|
|
if (i < 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
i += atomic_read(&c->congested);
|
|
|
|
if (i >= 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
i += CONGESTED_MAX;
|
|
|
|
|
|
|
|
if (i > 0)
|
|
|
|
i = fract_exp_two(i, 6);
|
|
|
|
|
bcache: Clean up bch_get_congested()
There are a few nits in this function. They could in theory all
be separate patches, but that's probably taking small commits
too far.
1) I added a brief comment saying what it does.
2) I like to declare pointer parameters "const" where possible
for documentation reasons.
3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming
weight of a 32-bit random number (giving a random integer with
mean 16 and variance 8). Passing by reference in a 64-bit variable
is silly; just use hweight32().
4) Its helper function fract_exp_two is unnecessarily tangled.
Gcc can optimize the multiply by (1 << x) to a shift, but it can
be written in a much more straightforward way at the cost of one
more bit of internal precision. Some analysis reveals that this
bit is always available.
This shrinks the object code for fract_exp_two(x, 6) from 23 bytes:
0000000000000000 <foo1>:
0: 89 f9 mov %edi,%ecx
2: c1 e9 06 shr $0x6,%ecx
5: b8 01 00 00 00 mov $0x1,%eax
a: d3 e0 shl %cl,%eax
c: 83 e7 3f and $0x3f,%edi
f: d3 e7 shl %cl,%edi
11: c1 ef 06 shr $0x6,%edi
14: 01 f8 add %edi,%eax
16: c3 retq
To 19:
0000000000000017 <foo2>:
17: 89 f8 mov %edi,%eax
19: 83 e0 3f and $0x3f,%eax
1c: 83 c0 40 add $0x40,%eax
1f: 89 f9 mov %edi,%ecx
21: c1 e9 06 shr $0x6,%ecx
24: d3 e0 shl %cl,%eax
26: c1 e8 06 shr $0x6,%eax
29: c3 retq
(Verified with 0 <= frac_bits <= 8, 0 <= x < 16<<frac_bits;
both versions produce the same output.)
5) And finally, the call to bch_get_congested() in check_should_bypass()
is separated from the use of the value by multiple tests which
could moot the need to compute it. Move the computation down to
where it's needed. This also saves a local register to hold the
computed value.
Signed-off-by: George Spelvin <lkml@sdf.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-25 00:48:30 +08:00
|
|
|
i -= hweight32(get_random_u32());
|
2013-09-11 10:02:45 +08:00
|
|
|
|
|
|
|
return i > 0 ? i : 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void add_sequential(struct task_struct *t)
|
|
|
|
{
|
|
|
|
ewma_add(t->sequential_io_avg,
|
|
|
|
t->sequential_io, 8, 0);
|
|
|
|
|
|
|
|
t->sequential_io = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct hlist_head *iohash(struct cached_dev *dc, uint64_t k)
|
|
|
|
{
|
|
|
|
return &dc->io_hash[hash_64(k, RECENT_IO_BITS)];
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool check_should_bypass(struct cached_dev *dc, struct bio *bio)
|
|
|
|
{
|
|
|
|
struct cache_set *c = dc->disk.c;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int mode = cache_mode(dc);
|
bcache: Clean up bch_get_congested()
There are a few nits in this function. They could in theory all
be separate patches, but that's probably taking small commits
too far.
1) I added a brief comment saying what it does.
2) I like to declare pointer parameters "const" where possible
for documentation reasons.
3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming
weight of a 32-bit random number (giving a random integer with
mean 16 and variance 8). Passing by reference in a 64-bit variable
is silly; just use hweight32().
4) Its helper function fract_exp_two is unnecessarily tangled.
Gcc can optimize the multiply by (1 << x) to a shift, but it can
be written in a much more straightforward way at the cost of one
more bit of internal precision. Some analysis reveals that this
bit is always available.
This shrinks the object code for fract_exp_two(x, 6) from 23 bytes:
0000000000000000 <foo1>:
0: 89 f9 mov %edi,%ecx
2: c1 e9 06 shr $0x6,%ecx
5: b8 01 00 00 00 mov $0x1,%eax
a: d3 e0 shl %cl,%eax
c: 83 e7 3f and $0x3f,%edi
f: d3 e7 shl %cl,%edi
11: c1 ef 06 shr $0x6,%edi
14: 01 f8 add %edi,%eax
16: c3 retq
To 19:
0000000000000017 <foo2>:
17: 89 f8 mov %edi,%eax
19: 83 e0 3f and $0x3f,%eax
1c: 83 c0 40 add $0x40,%eax
1f: 89 f9 mov %edi,%ecx
21: c1 e9 06 shr $0x6,%ecx
24: d3 e0 shl %cl,%eax
26: c1 e8 06 shr $0x6,%eax
29: c3 retq
(Verified with 0 <= frac_bits <= 8, 0 <= x < 16<<frac_bits;
both versions produce the same output.)
5) And finally, the call to bch_get_congested() in check_should_bypass()
is separated from the use of the value by multiple tests which
could moot the need to compute it. Move the computation down to
where it's needed. This also saves a local register to hold the
computed value.
Signed-off-by: George Spelvin <lkml@sdf.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-25 00:48:30 +08:00
|
|
|
unsigned int sectors, congested;
|
2013-09-11 10:02:45 +08:00
|
|
|
struct task_struct *task = current;
|
2013-07-31 13:34:40 +08:00
|
|
|
struct io *i;
|
2013-09-11 10:02:45 +08:00
|
|
|
|
2013-08-22 08:49:09 +08:00
|
|
|
if (test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) ||
|
2013-09-11 10:02:45 +08:00
|
|
|
c->gc_stats.in_use > CUTOFF_CACHE_ADD ||
|
2016-06-06 03:32:05 +08:00
|
|
|
(bio_op(bio) == REQ_OP_DISCARD))
|
2013-09-11 10:02:45 +08:00
|
|
|
goto skip;
|
|
|
|
|
|
|
|
if (mode == CACHE_MODE_NONE ||
|
|
|
|
(mode == CACHE_MODE_WRITEAROUND &&
|
2016-06-06 03:31:47 +08:00
|
|
|
op_is_write(bio_op(bio))))
|
2013-09-11 10:02:45 +08:00
|
|
|
goto skip;
|
|
|
|
|
2017-10-14 07:35:33 +08:00
|
|
|
/*
|
2020-02-01 22:42:33 +08:00
|
|
|
* If the bio is for read-ahead or background IO, bypass it or
|
|
|
|
* not depends on the following situations,
|
|
|
|
* - If the IO is for meta data, always cache it and no bypass
|
|
|
|
* - If the IO is not meta data, check dc->cache_reada_policy,
|
|
|
|
* BCH_CACHE_READA_ALL: cache it and not bypass
|
|
|
|
* BCH_CACHE_READA_META_ONLY: not cache it and bypass
|
|
|
|
* That is, read-ahead request for metadata always get cached
|
2019-02-09 12:53:11 +08:00
|
|
|
* (eg, for gfs2 or xfs).
|
2017-10-14 07:35:33 +08:00
|
|
|
*/
|
2020-02-01 22:42:33 +08:00
|
|
|
if ((bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND))) {
|
|
|
|
if (!(bio->bi_opf & (REQ_META|REQ_PRIO)) &&
|
|
|
|
(dc->cache_readahead_policy != BCH_CACHE_READA_ALL))
|
|
|
|
goto skip;
|
|
|
|
}
|
2017-10-14 07:35:33 +08:00
|
|
|
|
2020-10-01 14:50:56 +08:00
|
|
|
if (bio->bi_iter.bi_sector & (c->cache->sb.block_size - 1) ||
|
|
|
|
bio_sectors(bio) & (c->cache->sb.block_size - 1)) {
|
2020-05-27 12:01:52 +08:00
|
|
|
pr_debug("skipping unaligned io\n");
|
2013-09-11 10:02:45 +08:00
|
|
|
goto skip;
|
|
|
|
}
|
|
|
|
|
2013-09-11 05:27:42 +08:00
|
|
|
if (bypass_torture_test(dc)) {
|
2022-10-10 10:44:02 +08:00
|
|
|
if (get_random_u32_below(4) == 3)
|
2013-09-11 05:27:42 +08:00
|
|
|
goto skip;
|
|
|
|
else
|
|
|
|
goto rescale;
|
|
|
|
}
|
|
|
|
|
bcache: Clean up bch_get_congested()
There are a few nits in this function. They could in theory all
be separate patches, but that's probably taking small commits
too far.
1) I added a brief comment saying what it does.
2) I like to declare pointer parameters "const" where possible
for documentation reasons.
3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming
weight of a 32-bit random number (giving a random integer with
mean 16 and variance 8). Passing by reference in a 64-bit variable
is silly; just use hweight32().
4) Its helper function fract_exp_two is unnecessarily tangled.
Gcc can optimize the multiply by (1 << x) to a shift, but it can
be written in a much more straightforward way at the cost of one
more bit of internal precision. Some analysis reveals that this
bit is always available.
This shrinks the object code for fract_exp_two(x, 6) from 23 bytes:
0000000000000000 <foo1>:
0: 89 f9 mov %edi,%ecx
2: c1 e9 06 shr $0x6,%ecx
5: b8 01 00 00 00 mov $0x1,%eax
a: d3 e0 shl %cl,%eax
c: 83 e7 3f and $0x3f,%edi
f: d3 e7 shl %cl,%edi
11: c1 ef 06 shr $0x6,%edi
14: 01 f8 add %edi,%eax
16: c3 retq
To 19:
0000000000000017 <foo2>:
17: 89 f8 mov %edi,%eax
19: 83 e0 3f and $0x3f,%eax
1c: 83 c0 40 add $0x40,%eax
1f: 89 f9 mov %edi,%ecx
21: c1 e9 06 shr $0x6,%ecx
24: d3 e0 shl %cl,%eax
26: c1 e8 06 shr $0x6,%eax
29: c3 retq
(Verified with 0 <= frac_bits <= 8, 0 <= x < 16<<frac_bits;
both versions produce the same output.)
5) And finally, the call to bch_get_congested() in check_should_bypass()
is separated from the use of the value by multiple tests which
could moot the need to compute it. Move the computation down to
where it's needed. This also saves a local register to hold the
computed value.
Signed-off-by: George Spelvin <lkml@sdf.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-25 00:48:30 +08:00
|
|
|
congested = bch_get_congested(c);
|
2013-09-11 10:02:45 +08:00
|
|
|
if (!congested && !dc->sequential_cutoff)
|
|
|
|
goto rescale;
|
|
|
|
|
2013-07-31 13:34:40 +08:00
|
|
|
spin_lock(&dc->io_lock);
|
2013-09-11 10:02:45 +08:00
|
|
|
|
2013-10-12 06:44:27 +08:00
|
|
|
hlist_for_each_entry(i, iohash(dc, bio->bi_iter.bi_sector), hash)
|
|
|
|
if (i->last == bio->bi_iter.bi_sector &&
|
2013-07-31 13:34:40 +08:00
|
|
|
time_before(jiffies, i->jiffies))
|
|
|
|
goto found;
|
2013-09-11 10:02:45 +08:00
|
|
|
|
2013-07-31 13:34:40 +08:00
|
|
|
i = list_first_entry(&dc->io_lru, struct io, lru);
|
2013-09-11 10:02:45 +08:00
|
|
|
|
2013-07-31 13:34:40 +08:00
|
|
|
add_sequential(task);
|
|
|
|
i->sequential = 0;
|
2013-09-11 10:02:45 +08:00
|
|
|
found:
|
2013-10-12 06:44:27 +08:00
|
|
|
if (i->sequential + bio->bi_iter.bi_size > i->sequential)
|
|
|
|
i->sequential += bio->bi_iter.bi_size;
|
2013-09-11 10:02:45 +08:00
|
|
|
|
2013-07-31 13:34:40 +08:00
|
|
|
i->last = bio_end_sector(bio);
|
|
|
|
i->jiffies = jiffies + msecs_to_jiffies(5000);
|
|
|
|
task->sequential_io = i->sequential;
|
2013-09-11 10:02:45 +08:00
|
|
|
|
2013-07-31 13:34:40 +08:00
|
|
|
hlist_del(&i->hash);
|
|
|
|
hlist_add_head(&i->hash, iohash(dc, i->last));
|
|
|
|
list_move_tail(&i->lru, &dc->io_lru);
|
2013-09-11 10:02:45 +08:00
|
|
|
|
2013-07-31 13:34:40 +08:00
|
|
|
spin_unlock(&dc->io_lock);
|
2013-09-11 10:02:45 +08:00
|
|
|
|
|
|
|
sectors = max(task->sequential_io,
|
|
|
|
task->sequential_io_avg) >> 9;
|
|
|
|
|
|
|
|
if (dc->sequential_cutoff &&
|
|
|
|
sectors >= dc->sequential_cutoff >> 9) {
|
|
|
|
trace_bcache_bypass_sequential(bio);
|
|
|
|
goto skip;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (congested && sectors >= congested) {
|
|
|
|
trace_bcache_bypass_congested(bio);
|
|
|
|
goto skip;
|
|
|
|
}
|
|
|
|
|
|
|
|
rescale:
|
|
|
|
bch_rescale_priorities(c, bio_sectors(bio));
|
|
|
|
return false;
|
|
|
|
skip:
|
|
|
|
bch_mark_sectors_bypassed(c, dc, bio_sectors(bio));
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2013-07-25 08:41:08 +08:00
|
|
|
/* Cache lookup */
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
struct search {
|
|
|
|
/* Stack frame for bio_complete */
|
|
|
|
struct closure cl;
|
|
|
|
|
|
|
|
struct bbio bio;
|
|
|
|
struct bio *orig_bio;
|
|
|
|
struct bio *cache_miss;
|
2013-09-11 10:16:31 +08:00
|
|
|
struct bcache_device *d;
|
2013-09-11 10:02:45 +08:00
|
|
|
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int insert_bio_sectors;
|
|
|
|
unsigned int recoverable:1;
|
|
|
|
unsigned int write:1;
|
|
|
|
unsigned int read_dirty_data:1;
|
|
|
|
unsigned int cache_missed:1;
|
2013-09-11 10:02:45 +08:00
|
|
|
|
2021-01-24 18:02:37 +08:00
|
|
|
struct block_device *orig_bdev;
|
2013-09-11 10:02:45 +08:00
|
|
|
unsigned long start_time;
|
|
|
|
|
|
|
|
struct btree_op op;
|
|
|
|
struct data_insert_op iop;
|
|
|
|
};
|
|
|
|
|
2015-07-20 21:29:37 +08:00
|
|
|
static void bch_cache_read_endio(struct bio *bio)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct bbio *b = container_of(bio, struct bbio, bio);
|
|
|
|
struct closure *cl = bio->bi_private;
|
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the bucket was reused while our bio was in flight, we might have
|
|
|
|
* read the wrong data. Set s->error but not error so it doesn't get
|
|
|
|
* counted against the cache device, but we'll still reread the data
|
|
|
|
* from the backing device.
|
|
|
|
*/
|
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
if (bio->bi_status)
|
|
|
|
s->iop.status = bio->bi_status;
|
2013-08-10 12:14:13 +08:00
|
|
|
else if (!KEY_DIRTY(&b->key) &&
|
|
|
|
ptr_stale(s->iop.c, &b->key, 0)) {
|
2013-09-11 10:02:45 +08:00
|
|
|
atomic_long_inc(&s->iop.c->cache_read_races);
|
2017-06-03 15:38:06 +08:00
|
|
|
s->iop.status = BLK_STS_IOERR;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
bch_bbio_endio(s->iop.c, bio, bio->bi_status, "reading from cache");
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-25 08:41:08 +08:00
|
|
|
/*
|
|
|
|
* Read from a single key, handling the initial cache miss if the key starts in
|
|
|
|
* the middle of the bio
|
|
|
|
*/
|
2013-07-25 08:41:13 +08:00
|
|
|
static int cache_lookup_fn(struct btree_op *op, struct btree *b, struct bkey *k)
|
2013-07-25 08:41:08 +08:00
|
|
|
{
|
|
|
|
struct search *s = container_of(op, struct search, op);
|
2013-07-25 08:41:13 +08:00
|
|
|
struct bio *n, *bio = &s->bio.bio;
|
|
|
|
struct bkey *bio_key;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int ptr;
|
2013-07-25 08:41:08 +08:00
|
|
|
|
2013-10-12 06:44:27 +08:00
|
|
|
if (bkey_cmp(k, &KEY(s->iop.inode, bio->bi_iter.bi_sector, 0)) <= 0)
|
2013-07-25 08:41:13 +08:00
|
|
|
return MAP_CONTINUE;
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (KEY_INODE(k) != s->iop.inode ||
|
2013-10-12 06:44:27 +08:00
|
|
|
KEY_START(k) > bio->bi_iter.bi_sector) {
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int bio_sectors = bio_sectors(bio);
|
|
|
|
unsigned int sectors = KEY_INODE(k) == s->iop.inode
|
2013-07-25 08:41:13 +08:00
|
|
|
? min_t(uint64_t, INT_MAX,
|
2013-10-12 06:44:27 +08:00
|
|
|
KEY_START(k) - bio->bi_iter.bi_sector)
|
2013-07-25 08:41:13 +08:00
|
|
|
: INT_MAX;
|
|
|
|
int ret = s->d->cache_miss(b, s, bio, sectors);
|
2018-08-11 13:19:45 +08:00
|
|
|
|
2013-07-25 08:41:13 +08:00
|
|
|
if (ret != MAP_CONTINUE)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/* if this was a complete miss we shouldn't get here */
|
|
|
|
BUG_ON(bio_sectors <= sectors);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!KEY_SIZE(k))
|
|
|
|
return MAP_CONTINUE;
|
2013-07-25 08:41:08 +08:00
|
|
|
|
|
|
|
/* XXX: figure out best pointer - for multiple cache devices */
|
|
|
|
ptr = 0;
|
|
|
|
|
|
|
|
PTR_BUCKET(b->c, k, ptr)->prio = INITIAL_PRIO;
|
|
|
|
|
2013-09-11 05:27:42 +08:00
|
|
|
if (KEY_DIRTY(k))
|
|
|
|
s->read_dirty_data = true;
|
|
|
|
|
2013-11-24 10:21:01 +08:00
|
|
|
n = bio_next_split(bio, min_t(uint64_t, INT_MAX,
|
|
|
|
KEY_OFFSET(k) - bio->bi_iter.bi_sector),
|
2018-05-21 06:25:51 +08:00
|
|
|
GFP_NOIO, &s->d->bio_split);
|
2013-07-25 08:41:08 +08:00
|
|
|
|
2013-07-25 08:41:13 +08:00
|
|
|
bio_key = &container_of(n, struct bbio, bio)->key;
|
|
|
|
bch_bkey_copy_single_ptr(bio_key, k, ptr);
|
2013-07-25 08:41:08 +08:00
|
|
|
|
2013-10-12 06:44:27 +08:00
|
|
|
bch_cut_front(&KEY(s->iop.inode, n->bi_iter.bi_sector, 0), bio_key);
|
2013-09-11 10:02:45 +08:00
|
|
|
bch_cut_back(&KEY(s->iop.inode, bio_end_sector(n), 0), bio_key);
|
2013-07-25 08:41:08 +08:00
|
|
|
|
2013-07-25 08:41:13 +08:00
|
|
|
n->bi_end_io = bch_cache_read_endio;
|
|
|
|
n->bi_private = &s->cl;
|
2013-07-25 08:41:08 +08:00
|
|
|
|
2013-07-25 08:41:13 +08:00
|
|
|
/*
|
|
|
|
* The bucket we're reading from might be reused while our bio
|
|
|
|
* is in flight, and we could then end up reading the wrong
|
|
|
|
* data.
|
|
|
|
*
|
|
|
|
* We guard against this by checking (in cache_read_endio()) if
|
|
|
|
* the pointer is stale again; if so, we treat it as an error
|
|
|
|
* and reread from the backing device (but we don't pass that
|
|
|
|
* error up anywhere).
|
|
|
|
*/
|
2013-07-25 08:41:08 +08:00
|
|
|
|
2013-07-25 08:41:13 +08:00
|
|
|
__bch_submit_bbio(n, b->c);
|
|
|
|
return n == bio ? MAP_DONE : MAP_CONTINUE;
|
2013-07-25 08:41:08 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void cache_lookup(struct closure *cl)
|
|
|
|
{
|
2013-09-11 10:02:45 +08:00
|
|
|
struct search *s = container_of(cl, struct search, iop.cl);
|
2013-07-25 08:41:08 +08:00
|
|
|
struct bio *bio = &s->bio.bio;
|
bcache: ret IOERR when read meets metadata error
The read request might meet error when searching the btree, but the error
was not handled in cache_lookup(), and this kind of metadata failure will
not go into cached_dev_read_error(), finally, the upper layer will receive
bi_status=0. In this patch we judge the metadata error by the return
value of bch_btree_map_keys(), there are two potential paths give rise to
the error:
1. Because the btree is not totally cached in memery, we maybe get error
when read btree node from cache device (see bch_btree_node_get()), the
likely errno is -EIO, -ENOMEM
2. When read miss happens, bch_btree_insert_check_key() will be called to
insert a "replace_key" to btree(see cached_dev_cache_miss(), just for
doing preparatory work before insert the missed data to cache device),
a failure can also happen in this situation, the likely errno is
-ENOMEM
bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will
get either -EIO or -ENOMEM in above two cases. if this happened, we should
NOT recover data from backing device (when cache device is dirty) because
we don't know whether bkeys the read request covered are all clean. And
after that happened, s->iop.status is still its initially value(0) before
we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into
cached_dev_read_error(), and finally it can be passed to upper layer, or
recovered by reread from backing device.
[edit by mlyle: patch formatting, word-wrap, comment spelling,
commit log format]
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 04:21:18 +08:00
|
|
|
struct cached_dev *dc;
|
2013-09-11 10:16:31 +08:00
|
|
|
int ret;
|
2013-07-25 08:41:08 +08:00
|
|
|
|
2013-09-11 10:16:31 +08:00
|
|
|
bch_btree_op_init(&s->op, -1);
|
2013-07-25 08:41:08 +08:00
|
|
|
|
2013-09-11 10:16:31 +08:00
|
|
|
ret = bch_btree_map_keys(&s->op, s->iop.c,
|
|
|
|
&KEY(s->iop.inode, bio->bi_iter.bi_sector, 0),
|
|
|
|
cache_lookup_fn, MAP_END_KEY);
|
2015-03-06 23:37:46 +08:00
|
|
|
if (ret == -EAGAIN) {
|
2013-07-25 08:41:08 +08:00
|
|
|
continue_at(cl, cache_lookup, bcache_wq);
|
2015-03-06 23:37:46 +08:00
|
|
|
return;
|
|
|
|
}
|
2013-07-25 08:41:08 +08:00
|
|
|
|
bcache: ret IOERR when read meets metadata error
The read request might meet error when searching the btree, but the error
was not handled in cache_lookup(), and this kind of metadata failure will
not go into cached_dev_read_error(), finally, the upper layer will receive
bi_status=0. In this patch we judge the metadata error by the return
value of bch_btree_map_keys(), there are two potential paths give rise to
the error:
1. Because the btree is not totally cached in memery, we maybe get error
when read btree node from cache device (see bch_btree_node_get()), the
likely errno is -EIO, -ENOMEM
2. When read miss happens, bch_btree_insert_check_key() will be called to
insert a "replace_key" to btree(see cached_dev_cache_miss(), just for
doing preparatory work before insert the missed data to cache device),
a failure can also happen in this situation, the likely errno is
-ENOMEM
bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will
get either -EIO or -ENOMEM in above two cases. if this happened, we should
NOT recover data from backing device (when cache device is dirty) because
we don't know whether bkeys the read request covered are all clean. And
after that happened, s->iop.status is still its initially value(0) before
we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into
cached_dev_read_error(), and finally it can be passed to upper layer, or
recovered by reread from backing device.
[edit by mlyle: patch formatting, word-wrap, comment spelling,
commit log format]
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 04:21:18 +08:00
|
|
|
/*
|
|
|
|
* We might meet err when searching the btree, If that happens, we will
|
|
|
|
* get negative ret, in this scenario we should not recover data from
|
|
|
|
* backing device (when cache device is dirty) because we don't know
|
|
|
|
* whether bkeys the read request covered are all clean.
|
|
|
|
*
|
|
|
|
* And after that happened, s->iop.status is still its initial value
|
|
|
|
* before we submit s->bio.bio
|
|
|
|
*/
|
|
|
|
if (ret < 0) {
|
|
|
|
BUG_ON(ret == -EINTR);
|
|
|
|
if (s->d && s->d->c &&
|
|
|
|
!UUID_FLASH_ONLY(&s->d->c->uuids[s->d->id])) {
|
|
|
|
dc = container_of(s->d, struct cached_dev, disk);
|
|
|
|
if (dc && atomic_read(&dc->has_dirty))
|
|
|
|
s->recoverable = false;
|
|
|
|
}
|
|
|
|
if (!s->iop.status)
|
|
|
|
s->iop.status = BLK_STS_IOERR;
|
|
|
|
}
|
|
|
|
|
2013-07-25 08:41:08 +08:00
|
|
|
closure_return(cl);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Common code for the make_request functions */
|
|
|
|
|
2015-07-20 21:29:37 +08:00
|
|
|
static void request_endio(struct bio *bio)
|
2013-07-25 08:41:08 +08:00
|
|
|
{
|
|
|
|
struct closure *cl = bio->bi_private;
|
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
if (bio->bi_status) {
|
2013-07-25 08:41:08 +08:00
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
2018-08-11 13:19:45 +08:00
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
s->iop.status = bio->bi_status;
|
2013-07-25 08:41:08 +08:00
|
|
|
/* Only cache read errors are recoverable */
|
|
|
|
s->recoverable = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
bio_put(bio);
|
|
|
|
closure_put(cl);
|
|
|
|
}
|
|
|
|
|
2018-03-19 08:36:24 +08:00
|
|
|
static void backing_request_endio(struct bio *bio)
|
|
|
|
{
|
|
|
|
struct closure *cl = bio->bi_private;
|
|
|
|
|
|
|
|
if (bio->bi_status) {
|
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
bcache: add io_disable to struct cached_dev
If a bcache device is configured to writeback mode, current code does not
handle write I/O errors on backing devices properly.
In writeback mode, write request is written to cache device, and
latter being flushed to backing device. If I/O failed when writing from
cache device to the backing device, bcache code just ignores the error and
upper layer code is NOT noticed that the backing device is broken.
This patch tries to handle backing device failure like how the cache device
failure is handled,
- Add a error counter 'io_errors' and error limit 'error_limit' in struct
cached_dev. Add another io_disable to struct cached_dev to disable I/Os
on the problematic backing device.
- When I/O error happens on backing device, increase io_errors counter. And
if io_errors reaches error_limit, set cache_dev->io_disable to true, and
stop the bcache device.
The result is, if backing device is broken of disconnected, and I/O errors
reach its error limit, backing device will be disabled and the associated
bcache device will be removed from system.
Changelog:
v2: remove "bcache: " prefix in pr_error(), and use correct name string to
print out bcache device gendisk name.
v1: indeed this is new added in v2 patch set.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:25 +08:00
|
|
|
struct cached_dev *dc = container_of(s->d,
|
|
|
|
struct cached_dev, disk);
|
2018-03-19 08:36:24 +08:00
|
|
|
/*
|
|
|
|
* If a bio has REQ_PREFLUSH for writeback mode, it is
|
|
|
|
* speically assembled in cached_dev_write() for a non-zero
|
|
|
|
* write request which has REQ_PREFLUSH. we don't set
|
|
|
|
* s->iop.status by this failure, the status will be decided
|
|
|
|
* by result of bch_data_insert() operation.
|
|
|
|
*/
|
|
|
|
if (unlikely(s->iop.writeback &&
|
|
|
|
bio->bi_opf & REQ_PREFLUSH)) {
|
2021-10-20 22:38:10 +08:00
|
|
|
pr_err("Can't flush %pg: returned bi_status %i\n",
|
|
|
|
dc->bdev, bio->bi_status);
|
2018-03-19 08:36:24 +08:00
|
|
|
} else {
|
|
|
|
/* set to orig_bio->bi_status in bio_complete() */
|
|
|
|
s->iop.status = bio->bi_status;
|
|
|
|
}
|
|
|
|
s->recoverable = false;
|
|
|
|
/* should count I/O error for backing device here */
|
bcache: add io_disable to struct cached_dev
If a bcache device is configured to writeback mode, current code does not
handle write I/O errors on backing devices properly.
In writeback mode, write request is written to cache device, and
latter being flushed to backing device. If I/O failed when writing from
cache device to the backing device, bcache code just ignores the error and
upper layer code is NOT noticed that the backing device is broken.
This patch tries to handle backing device failure like how the cache device
failure is handled,
- Add a error counter 'io_errors' and error limit 'error_limit' in struct
cached_dev. Add another io_disable to struct cached_dev to disable I/Os
on the problematic backing device.
- When I/O error happens on backing device, increase io_errors counter. And
if io_errors reaches error_limit, set cache_dev->io_disable to true, and
stop the bcache device.
The result is, if backing device is broken of disconnected, and I/O errors
reach its error limit, backing device will be disabled and the associated
bcache device will be removed from system.
Changelog:
v2: remove "bcache: " prefix in pr_error(), and use correct name string to
print out bcache device gendisk name.
v1: indeed this is new added in v2 patch set.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:25 +08:00
|
|
|
bch_count_backing_io_errors(dc, bio);
|
2018-03-19 08:36:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
bio_put(bio);
|
|
|
|
closure_put(cl);
|
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
static void bio_complete(struct search *s)
|
|
|
|
{
|
|
|
|
if (s->orig_bio) {
|
2020-07-28 21:59:20 +08:00
|
|
|
/* Count on bcache device */
|
2021-01-24 18:02:37 +08:00
|
|
|
bio_end_io_acct_remapped(s->orig_bio, s->start_time,
|
|
|
|
s->orig_bdev);
|
2013-09-11 10:02:45 +08:00
|
|
|
trace_bcache_request_end(s->d, s->orig_bio);
|
2017-06-03 15:38:06 +08:00
|
|
|
s->orig_bio->bi_status = s->iop.status;
|
2015-07-20 21:29:37 +08:00
|
|
|
bio_endio(s->orig_bio);
|
2013-03-24 07:11:31 +08:00
|
|
|
s->orig_bio = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-03-19 08:36:24 +08:00
|
|
|
static void do_bio_hook(struct search *s,
|
|
|
|
struct bio *orig_bio,
|
|
|
|
bio_end_io_t *end_io_fn)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct bio *bio = &s->bio.bio;
|
|
|
|
|
2022-04-20 00:04:25 +08:00
|
|
|
bio_init_clone(orig_bio->bi_bdev, bio, orig_bio, GFP_NOIO);
|
2018-03-19 08:36:24 +08:00
|
|
|
/*
|
|
|
|
* bi_end_io can be set separately somewhere else, e.g. the
|
|
|
|
* variants in,
|
|
|
|
* - cache_bio->bi_end_io from cached_dev_cache_miss()
|
|
|
|
* - n->bi_end_io from cache_lookup_fn()
|
|
|
|
*/
|
|
|
|
bio->bi_end_io = end_io_fn;
|
2013-03-24 07:11:31 +08:00
|
|
|
bio->bi_private = &s->cl;
|
2013-11-23 11:37:48 +08:00
|
|
|
|
2015-04-18 06:23:59 +08:00
|
|
|
bio_cnt_set(bio, 3);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void search_free(struct closure *cl)
|
|
|
|
{
|
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
|
|
|
|
2019-04-25 00:48:26 +08:00
|
|
|
atomic_dec(&s->iop.c->search_inflight);
|
bcache: finish incremental GC
In GC thread, we record the latest GC key in gc_done, which is expected
to be used for incremental GC, but in currently code, we didn't realize
it. When GC runs, front side IO would be blocked until the GC over, it
would be a long time if there is a lot of btree nodes.
This patch realizes incremental GC, the main ideal is that, when there
are front side I/Os, after GC some nodes (100), we stop GC, release locker
of the btree node, and go to process the front side I/Os for some times
(100 ms), then go back to GC again.
By this patch, when we doing GC, I/Os are not blocked all the time, and
there is no obvious I/Os zero jump problem any more.
Patch v2: Rename some variables and macros name as Coly suggested.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 12:17:34 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (s->iop.bio)
|
|
|
|
bio_put(s->iop.bio);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2018-02-28 01:49:30 +08:00
|
|
|
bio_complete(s);
|
2013-03-24 07:11:31 +08:00
|
|
|
closure_debug_destroy(cl);
|
2019-04-25 00:48:26 +08:00
|
|
|
mempool_free(s, &s->iop.c->search);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 10:16:31 +08:00
|
|
|
static inline struct search *search_alloc(struct bio *bio,
|
2021-01-24 18:02:37 +08:00
|
|
|
struct bcache_device *d, struct block_device *orig_bdev,
|
|
|
|
unsigned long start_time)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-07-25 08:26:51 +08:00
|
|
|
struct search *s;
|
|
|
|
|
2018-05-21 06:25:51 +08:00
|
|
|
s = mempool_alloc(&d->c->search, GFP_NOIO);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:16:31 +08:00
|
|
|
closure_init(&s->cl, NULL);
|
2018-03-19 08:36:24 +08:00
|
|
|
do_bio_hook(s, bio, request_endio);
|
bcache: finish incremental GC
In GC thread, we record the latest GC key in gc_done, which is expected
to be used for incremental GC, but in currently code, we didn't realize
it. When GC runs, front side IO would be blocked until the GC over, it
would be a long time if there is a lot of btree nodes.
This patch realizes incremental GC, the main ideal is that, when there
are front side I/Os, after GC some nodes (100), we stop GC, release locker
of the btree node, and go to process the front side I/Os for some times
(100 ms), then go back to GC again.
By this patch, when we doing GC, I/Os are not blocked all the time, and
there is no obvious I/Os zero jump problem any more.
Patch v2: Rename some variables and macros name as Coly suggested.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 12:17:34 +08:00
|
|
|
atomic_inc(&d->c->search_inflight);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
s->orig_bio = bio;
|
2013-09-11 10:16:31 +08:00
|
|
|
s->cache_miss = NULL;
|
2017-10-31 05:46:34 +08:00
|
|
|
s->cache_missed = 0;
|
2013-09-11 10:16:31 +08:00
|
|
|
s->d = d;
|
2013-03-24 07:11:31 +08:00
|
|
|
s->recoverable = 1;
|
2016-06-06 03:31:47 +08:00
|
|
|
s->write = op_is_write(bio_op(bio));
|
2013-09-11 10:16:31 +08:00
|
|
|
s->read_dirty_data = 0;
|
2020-07-28 21:59:20 +08:00
|
|
|
/* Count on the bcache device */
|
2021-01-24 18:02:37 +08:00
|
|
|
s->orig_bdev = orig_bdev;
|
|
|
|
s->start_time = start_time;
|
2013-09-11 10:16:31 +08:00
|
|
|
s->iop.c = d->c;
|
|
|
|
s->iop.bio = NULL;
|
|
|
|
s->iop.inode = d->id;
|
|
|
|
s->iop.write_point = hash_long((unsigned long) current, 16);
|
|
|
|
s->iop.write_prio = 0;
|
2017-06-03 15:38:06 +08:00
|
|
|
s->iop.status = 0;
|
2013-09-11 10:16:31 +08:00
|
|
|
s->iop.flags = 0;
|
2017-01-27 23:30:47 +08:00
|
|
|
s->iop.flush_journal = op_is_flush(bio->bi_opf);
|
2014-01-10 08:03:04 +08:00
|
|
|
s->iop.wq = bcache_wq;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Cached devices */
|
|
|
|
|
|
|
|
static void cached_dev_bio_complete(struct closure *cl)
|
|
|
|
{
|
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
|
|
|
struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
|
|
|
|
|
|
|
|
cached_dev_put(dc);
|
2019-04-25 00:48:26 +08:00
|
|
|
search_free(cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Process reads */
|
|
|
|
|
2019-04-25 00:48:26 +08:00
|
|
|
static void cached_dev_read_error_done(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (s->iop.replace_collision)
|
|
|
|
bch_mark_cache_miss_collision(s->iop.c, s->d);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2016-09-22 15:10:01 +08:00
|
|
|
if (s->iop.bio)
|
|
|
|
bio_free_pages(s->iop.bio);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
cached_dev_bio_complete(cl);
|
|
|
|
}
|
|
|
|
|
2013-09-11 08:06:17 +08:00
|
|
|
static void cached_dev_read_error(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
2013-09-11 08:06:17 +08:00
|
|
|
struct bio *bio = &s->bio.bio;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
bcache: only permit to recovery read error when cache device is clean
When bcache does read I/Os, for example in writeback or writethrough mode,
if a read request on cache device is failed, bcache will try to recovery
the request by reading from cached device. If the data on cached device is
not synced with cache device, then requester will get a stale data.
For critical storage system like database, providing stale data from
recovery may result an application level data corruption, which is
unacceptible.
With this patch, for a failed read request in writeback or writethrough
mode, recovery a recoverable read request only happens when cache device
is clean. That is to say, all data on cached device is up to update.
For other cache modes in bcache, read request will never hit
cached_dev_read_error(), they don't need this patch.
Please note, because cache mode can be switched arbitrarily in run time, a
writethrough mode might be switched from a writeback mode. Therefore
checking dc->has_data in writethrough mode still makes sense.
Changelog:
V4: Fix parens error pointed by Michael Lyle.
v3: By response from Kent Oversteet, he thinks recovering stale data is a
bug to fix, and option to permit it is unnecessary. So this version
the sysfs file is removed.
v2: rename sysfs entry from allow_stale_data_on_failure to
allow_stale_data_on_failure, and fix the confusing commit log.
v1: initial patch posted.
[small change to patch comment spelling by mlyle]
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reported-by: Arne Wolf <awolf@lenovo.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Kai Krakow <hurikhan77@gmail.com>
Cc: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-31 05:46:31 +08:00
|
|
|
/*
|
bcache: recover data from backing when data is clean
When we send a read request and hit the clean data in cache device, there
is a situation called cache read race in bcache(see the commit in the tail
of cache_look_up(), the following explaination just copy from there):
The bucket we're reading from might be reused while our bio is in flight,
and we could then end up reading the wrong data. We guard against this
by checking (in bch_cache_read_endio()) if the pointer is stale again;
if so, we treat it as an error (s->iop.error = -EINTR) and reread from
the backing device (but we don't pass that error up anywhere)
It should be noted that cache read race happened under normal
circumstances, not the circumstance when SSD failed, it was counted
and shown in /sys/fs/bcache/XXX/internal/cache_read_races.
Without this patch, when we use writeback mode, we will never reread from
the backing device when cache read race happened, until the whole cache
device is clean, because the condition
(s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
will be passed up, at last, user will receive -EINTR when it's bio end,
this is not suitable, and wield to up-application.
In this patch, we use s->read_dirty_data to judge whether the read
request hit dirty data in cache device, it is safe to reread data from
the backing device when the read request hit clean data. This can not
only handle cache read race, but also recover data when failed read
request from cache device.
[edited by mlyle to fix up whitespace, commit log title, comment
spelling]
Fixes: d59b23795933 ("bcache: only permit to recovery read error when cache device is clean")
Cc: <stable@vger.kernel.org> # 4.14
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-25 07:14:26 +08:00
|
|
|
* If read request hit dirty data (s->read_dirty_data is true),
|
|
|
|
* then recovery a failed read request from cached device may
|
|
|
|
* get a stale data back. So read failure recovery is only
|
|
|
|
* permitted when read request hit clean data in cache device,
|
|
|
|
* or when cache read race happened.
|
bcache: only permit to recovery read error when cache device is clean
When bcache does read I/Os, for example in writeback or writethrough mode,
if a read request on cache device is failed, bcache will try to recovery
the request by reading from cached device. If the data on cached device is
not synced with cache device, then requester will get a stale data.
For critical storage system like database, providing stale data from
recovery may result an application level data corruption, which is
unacceptible.
With this patch, for a failed read request in writeback or writethrough
mode, recovery a recoverable read request only happens when cache device
is clean. That is to say, all data on cached device is up to update.
For other cache modes in bcache, read request will never hit
cached_dev_read_error(), they don't need this patch.
Please note, because cache mode can be switched arbitrarily in run time, a
writethrough mode might be switched from a writeback mode. Therefore
checking dc->has_data in writethrough mode still makes sense.
Changelog:
V4: Fix parens error pointed by Michael Lyle.
v3: By response from Kent Oversteet, he thinks recovering stale data is a
bug to fix, and option to permit it is unnecessary. So this version
the sysfs file is removed.
v2: rename sysfs entry from allow_stale_data_on_failure to
allow_stale_data_on_failure, and fix the confusing commit log.
v1: initial patch posted.
[small change to patch comment spelling by mlyle]
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reported-by: Arne Wolf <awolf@lenovo.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Kai Krakow <hurikhan77@gmail.com>
Cc: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-31 05:46:31 +08:00
|
|
|
*/
|
bcache: recover data from backing when data is clean
When we send a read request and hit the clean data in cache device, there
is a situation called cache read race in bcache(see the commit in the tail
of cache_look_up(), the following explaination just copy from there):
The bucket we're reading from might be reused while our bio is in flight,
and we could then end up reading the wrong data. We guard against this
by checking (in bch_cache_read_endio()) if the pointer is stale again;
if so, we treat it as an error (s->iop.error = -EINTR) and reread from
the backing device (but we don't pass that error up anywhere)
It should be noted that cache read race happened under normal
circumstances, not the circumstance when SSD failed, it was counted
and shown in /sys/fs/bcache/XXX/internal/cache_read_races.
Without this patch, when we use writeback mode, we will never reread from
the backing device when cache read race happened, until the whole cache
device is clean, because the condition
(s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
will be passed up, at last, user will receive -EINTR when it's bio end,
this is not suitable, and wield to up-application.
In this patch, we use s->read_dirty_data to judge whether the read
request hit dirty data in cache device, it is safe to reread data from
the backing device when the read request hit clean data. This can not
only handle cache read race, but also recover data when failed read
request from cache device.
[edited by mlyle to fix up whitespace, commit log title, comment
spelling]
Fixes: d59b23795933 ("bcache: only permit to recovery read error when cache device is clean")
Cc: <stable@vger.kernel.org> # 4.14
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-25 07:14:26 +08:00
|
|
|
if (s->recoverable && !s->read_dirty_data) {
|
2013-04-27 06:39:55 +08:00
|
|
|
/* Retry from the backing device: */
|
|
|
|
trace_bcache_read_retry(s->orig_bio);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
s->iop.status = 0;
|
2018-03-19 08:36:24 +08:00
|
|
|
do_bio_hook(s, s->orig_bio, backing_request_endio);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
/* XXX: invalidate cache */
|
|
|
|
|
2018-03-19 08:36:24 +08:00
|
|
|
/* I/O request sent to backing device */
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
closure_bio_submit(s->iop.c, bio, cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2019-04-25 00:48:26 +08:00
|
|
|
continue_at(cl, cached_dev_read_error_done, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void cached_dev_cache_miss_done(struct closure *cl)
|
|
|
|
{
|
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
|
|
|
struct bcache_device *d = s->d;
|
|
|
|
|
|
|
|
if (s->iop.replace_collision)
|
|
|
|
bch_mark_cache_miss_collision(s->iop.c, s->d);
|
|
|
|
|
|
|
|
if (s->iop.bio)
|
|
|
|
bio_free_pages(s->iop.bio);
|
|
|
|
|
|
|
|
cached_dev_bio_complete(cl);
|
|
|
|
closure_put(&d->cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 08:06:17 +08:00
|
|
|
static void cached_dev_read_done(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
|
|
|
struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
|
|
|
|
|
|
|
|
/*
|
2013-09-11 08:06:17 +08:00
|
|
|
* We had a cache miss; cache_bio now contains data ready to be inserted
|
|
|
|
* into the cache.
|
2013-03-24 07:11:31 +08:00
|
|
|
*
|
|
|
|
* First, we copy the data we just read from cache_bio's bounce buffers
|
|
|
|
* to the buffers the original bio pointed to:
|
|
|
|
*/
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (s->iop.bio) {
|
2022-01-24 17:11:07 +08:00
|
|
|
bio_reset(s->iop.bio, s->cache_miss->bi_bdev, REQ_OP_READ);
|
2018-08-11 13:19:47 +08:00
|
|
|
s->iop.bio->bi_iter.bi_sector =
|
|
|
|
s->cache_miss->bi_iter.bi_sector;
|
2013-10-12 06:44:27 +08:00
|
|
|
s->iop.bio->bi_iter.bi_size = s->insert_bio_sectors << 9;
|
2022-01-24 17:11:07 +08:00
|
|
|
bio_clone_blkg_association(s->iop.bio, s->cache_miss);
|
2013-09-11 10:02:45 +08:00
|
|
|
bch_bio_map(s->iop.bio, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
bio_copy_data(s->cache_miss, s->iop.bio);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
bio_put(s->cache_miss);
|
|
|
|
s->cache_miss = NULL;
|
|
|
|
}
|
|
|
|
|
2017-10-14 07:35:34 +08:00
|
|
|
if (verify(dc) && s->recoverable && !s->read_dirty_data)
|
2013-09-11 10:02:45 +08:00
|
|
|
bch_data_verify(dc, s->orig_bio);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2019-04-25 00:48:26 +08:00
|
|
|
closure_get(&dc->disk.cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
bio_complete(s);
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (s->iop.bio &&
|
|
|
|
!test_bit(CACHE_SET_STOPPING, &s->iop.c->flags)) {
|
|
|
|
BUG_ON(!s->iop.replace);
|
|
|
|
closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 08:06:17 +08:00
|
|
|
continue_at(cl, cached_dev_cache_miss_done, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 08:06:17 +08:00
|
|
|
static void cached_dev_read_done_bh(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
|
|
|
struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
bch_mark_cache_accounting(s->iop.c, s->d,
|
2017-10-31 05:46:34 +08:00
|
|
|
!s->cache_missed, s->iop.bypass);
|
2018-10-08 20:41:08 +08:00
|
|
|
trace_bcache_read(s->orig_bio, !s->cache_missed, s->iop.bypass);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
if (s->iop.status)
|
2013-09-11 08:06:17 +08:00
|
|
|
continue_at_nobarrier(cl, cached_dev_read_error, bcache_wq);
|
2017-10-14 07:35:34 +08:00
|
|
|
else if (s->iop.bio || verify(dc))
|
2013-09-11 08:06:17 +08:00
|
|
|
continue_at_nobarrier(cl, cached_dev_read_done, bcache_wq);
|
2013-03-24 07:11:31 +08:00
|
|
|
else
|
2013-09-11 08:06:17 +08:00
|
|
|
continue_at_nobarrier(cl, cached_dev_bio_complete, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int cached_dev_cache_miss(struct btree *b, struct search *s,
|
2018-08-11 13:19:44 +08:00
|
|
|
struct bio *bio, unsigned int sectors)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-07-25 08:41:08 +08:00
|
|
|
int ret = MAP_CONTINUE;
|
2013-03-24 07:11:31 +08:00
|
|
|
struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
|
2013-09-11 08:06:17 +08:00
|
|
|
struct bio *miss, *cache_bio;
|
bcache: avoid oversized read request in cache missing code path
In the cache missing code path of cached device, if a proper location
from the internal B+ tree is matched for a cache miss range, function
cached_dev_cache_miss() will be called in cache_lookup_fn() in the
following code block,
[code block 1]
526 unsigned int sectors = KEY_INODE(k) == s->iop.inode
527 ? min_t(uint64_t, INT_MAX,
528 KEY_START(k) - bio->bi_iter.bi_sector)
529 : INT_MAX;
530 int ret = s->d->cache_miss(b, s, bio, sectors);
Here s->d->cache_miss() is the call backfunction pointer initialized as
cached_dev_cache_miss(), the last parameter 'sectors' is an important
hint to calculate the size of read request to backing device of the
missing cache data.
Current calculation in above code block may generate oversized value of
'sectors', which consequently may trigger 2 different potential kernel
panics by BUG() or BUG_ON() as listed below,
1) BUG_ON() inside bch_btree_insert_key(),
[code block 2]
886 BUG_ON(b->ops->is_extents && !KEY_SIZE(k));
2) BUG() inside biovec_slab(),
[code block 3]
51 default:
52 BUG();
53 return NULL;
All the above panics are original from cached_dev_cache_miss() by the
oversized parameter 'sectors'.
Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate
the size of data read from backing device for the cache missing. This
size is stored in s->insert_bio_sectors by the following lines of code,
[code block 4]
909 s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
Then the actual key inserting to the internal B+ tree is generated and
stored in s->iop.replace_key by the following lines of code,
[code block 5]
911 s->iop.replace_key = KEY(s->iop.inode,
912 bio->bi_iter.bi_sector + s->insert_bio_sectors,
913 s->insert_bio_sectors);
The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from
the above code block.
And the bio sending to backing device for the missing data is allocated
with hint from s->insert_bio_sectors by the following lines of code,
[code block 6]
926 cache_bio = bio_alloc_bioset(GFP_NOWAIT,
927 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS),
928 &dc->disk.bio_split);
The oversized parameter 'sectors' may trigger panic 2) by BUG() from the
agove code block.
Now let me explain how the panics happen with the oversized 'sectors'.
In code block 5, replace_key is generated by macro KEY(). From the
definition of macro KEY(),
[code block 7]
71 #define KEY(inode, offset, size) \
72 ((struct bkey) { \
73 .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode), \
74 .low = (offset) \
75 })
Here 'size' is 16bits width embedded in 64bits member 'high' of struct
bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is
very probably to be larger than (1<<16) - 1, which makes the bkey size
calculation in code block 5 is overflowed. In one bug report the value
of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors'
results the overflowed s->insert_bio_sectors in code block 4, then makes
size field of s->iop.replace_key to be 0 in code block 5. Then the 0-
sized s->iop.replace_key is inserted into the internal B+ tree as cache
missing check key (a special key to detect and avoid a racing between
normal write request and cache missing read request) as,
[code block 8]
915 ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);
Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey
size check BUG_ON() in code block 2, and causes the kernel panic 1).
Another kernel panic is from code block 6, is by the bvecs number
oversized value s->insert_bio_sectors from code block 4,
min(sectors, bio_sectors(bio) + reada)
There are two possibility for oversized reresult,
- bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized.
- sectors < bio_sectors(bio) + reada, but sectors is oversized.
From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors,
PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many
other values which larther than BIO_MAX_VECS (a.k.a 256). When calling
bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter,
this value will eventually be sent to biovec_slab() as parameter
'nr_vecs' in following code path,
bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab()
Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG()
in code block 3 is triggered inside biovec_slab().
From the above analysis, we know that the 4th parameter 'sector' sent
into cached_dev_cache_miss() may cause overflow in code block 5 and 6,
and finally cause kernel panic in code block 2 and 3. And if result of
bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger
kernel panic in code block 3 from code block 6.
Now the almost-useless readahead size for cache missing request back to
backing device is removed, this patch can fix the oversized issue with
more simpler method.
- add a local variable size_limit, set it by the minimum value from
the max bkey size and max bio bvecs number.
- set s->insert_bio_sectors by the minimum value from size_limit,
sectors, and the sectors size of bio.
- replace sectors by s->insert_bio_sectors to do bio_next_split.
By the above method with size_limit, s->insert_bio_sectors will never
result oversized replace_key size or bio bvecs number. And split bio
'miss' from bio_next_split() will always match the size of 'cache_bio',
that is the current maximum bio size we can sent to backing device for
fetching the cache missing data.
Current problmatic code can be partially found since Linux v3.13-rc1,
therefore all maintained stable kernels should try to apply this fix.
Reported-by: Alexander Ullrich <ealex1979@gmail.com>
Reported-by: Diego Ercolani <diego.ercolani@gmail.com>
Reported-by: Jan Szubiak <jan.szubiak@linuxpolska.pl>
Reported-by: Marco Rebhan <me@dblsaiko.net>
Reported-by: Matthias Ferdinand <bcache@mfedv.net>
Reported-by: Victor Westerhuis <victor@westerhu.is>
Reported-by: Vojtech Pavlik <vojtech@suse.cz>
Reported-and-tested-by: Rolf Fokkens <rolf@rolffokkens.nl>
Reported-and-tested-by: Thorsten Knabe <linux@thorsten-knabe.de>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Takashi Iwai <tiwai@suse.com>
Link: https://lore.kernel.org/r/20210607125052.21277-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-07 20:50:52 +08:00
|
|
|
unsigned int size_limit;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2017-10-31 05:46:34 +08:00
|
|
|
s->cache_missed = 1;
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (s->cache_miss || s->iop.bypass) {
|
2018-05-21 06:25:51 +08:00
|
|
|
miss = bio_next_split(bio, sectors, GFP_NOIO, &s->d->bio_split);
|
2013-07-25 08:41:08 +08:00
|
|
|
ret = miss == bio ? MAP_DONE : MAP_CONTINUE;
|
2013-09-11 09:39:16 +08:00
|
|
|
goto out_submit;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
bcache: avoid oversized read request in cache missing code path
In the cache missing code path of cached device, if a proper location
from the internal B+ tree is matched for a cache miss range, function
cached_dev_cache_miss() will be called in cache_lookup_fn() in the
following code block,
[code block 1]
526 unsigned int sectors = KEY_INODE(k) == s->iop.inode
527 ? min_t(uint64_t, INT_MAX,
528 KEY_START(k) - bio->bi_iter.bi_sector)
529 : INT_MAX;
530 int ret = s->d->cache_miss(b, s, bio, sectors);
Here s->d->cache_miss() is the call backfunction pointer initialized as
cached_dev_cache_miss(), the last parameter 'sectors' is an important
hint to calculate the size of read request to backing device of the
missing cache data.
Current calculation in above code block may generate oversized value of
'sectors', which consequently may trigger 2 different potential kernel
panics by BUG() or BUG_ON() as listed below,
1) BUG_ON() inside bch_btree_insert_key(),
[code block 2]
886 BUG_ON(b->ops->is_extents && !KEY_SIZE(k));
2) BUG() inside biovec_slab(),
[code block 3]
51 default:
52 BUG();
53 return NULL;
All the above panics are original from cached_dev_cache_miss() by the
oversized parameter 'sectors'.
Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate
the size of data read from backing device for the cache missing. This
size is stored in s->insert_bio_sectors by the following lines of code,
[code block 4]
909 s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
Then the actual key inserting to the internal B+ tree is generated and
stored in s->iop.replace_key by the following lines of code,
[code block 5]
911 s->iop.replace_key = KEY(s->iop.inode,
912 bio->bi_iter.bi_sector + s->insert_bio_sectors,
913 s->insert_bio_sectors);
The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from
the above code block.
And the bio sending to backing device for the missing data is allocated
with hint from s->insert_bio_sectors by the following lines of code,
[code block 6]
926 cache_bio = bio_alloc_bioset(GFP_NOWAIT,
927 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS),
928 &dc->disk.bio_split);
The oversized parameter 'sectors' may trigger panic 2) by BUG() from the
agove code block.
Now let me explain how the panics happen with the oversized 'sectors'.
In code block 5, replace_key is generated by macro KEY(). From the
definition of macro KEY(),
[code block 7]
71 #define KEY(inode, offset, size) \
72 ((struct bkey) { \
73 .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode), \
74 .low = (offset) \
75 })
Here 'size' is 16bits width embedded in 64bits member 'high' of struct
bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is
very probably to be larger than (1<<16) - 1, which makes the bkey size
calculation in code block 5 is overflowed. In one bug report the value
of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors'
results the overflowed s->insert_bio_sectors in code block 4, then makes
size field of s->iop.replace_key to be 0 in code block 5. Then the 0-
sized s->iop.replace_key is inserted into the internal B+ tree as cache
missing check key (a special key to detect and avoid a racing between
normal write request and cache missing read request) as,
[code block 8]
915 ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);
Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey
size check BUG_ON() in code block 2, and causes the kernel panic 1).
Another kernel panic is from code block 6, is by the bvecs number
oversized value s->insert_bio_sectors from code block 4,
min(sectors, bio_sectors(bio) + reada)
There are two possibility for oversized reresult,
- bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized.
- sectors < bio_sectors(bio) + reada, but sectors is oversized.
From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors,
PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many
other values which larther than BIO_MAX_VECS (a.k.a 256). When calling
bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter,
this value will eventually be sent to biovec_slab() as parameter
'nr_vecs' in following code path,
bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab()
Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG()
in code block 3 is triggered inside biovec_slab().
From the above analysis, we know that the 4th parameter 'sector' sent
into cached_dev_cache_miss() may cause overflow in code block 5 and 6,
and finally cause kernel panic in code block 2 and 3. And if result of
bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger
kernel panic in code block 3 from code block 6.
Now the almost-useless readahead size for cache missing request back to
backing device is removed, this patch can fix the oversized issue with
more simpler method.
- add a local variable size_limit, set it by the minimum value from
the max bkey size and max bio bvecs number.
- set s->insert_bio_sectors by the minimum value from size_limit,
sectors, and the sectors size of bio.
- replace sectors by s->insert_bio_sectors to do bio_next_split.
By the above method with size_limit, s->insert_bio_sectors will never
result oversized replace_key size or bio bvecs number. And split bio
'miss' from bio_next_split() will always match the size of 'cache_bio',
that is the current maximum bio size we can sent to backing device for
fetching the cache missing data.
Current problmatic code can be partially found since Linux v3.13-rc1,
therefore all maintained stable kernels should try to apply this fix.
Reported-by: Alexander Ullrich <ealex1979@gmail.com>
Reported-by: Diego Ercolani <diego.ercolani@gmail.com>
Reported-by: Jan Szubiak <jan.szubiak@linuxpolska.pl>
Reported-by: Marco Rebhan <me@dblsaiko.net>
Reported-by: Matthias Ferdinand <bcache@mfedv.net>
Reported-by: Victor Westerhuis <victor@westerhu.is>
Reported-by: Vojtech Pavlik <vojtech@suse.cz>
Reported-and-tested-by: Rolf Fokkens <rolf@rolffokkens.nl>
Reported-and-tested-by: Thorsten Knabe <linux@thorsten-knabe.de>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Takashi Iwai <tiwai@suse.com>
Link: https://lore.kernel.org/r/20210607125052.21277-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-07 20:50:52 +08:00
|
|
|
/* Limitation for valid replace key size and cache_bio bvecs number */
|
|
|
|
size_limit = min_t(unsigned int, BIO_MAX_VECS * PAGE_SECTORS,
|
|
|
|
(1 << KEY_SIZE_BITS) - 1);
|
|
|
|
s->insert_bio_sectors = min3(size_limit, sectors, bio_sectors(bio));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
s->iop.replace_key = KEY(s->iop.inode,
|
2013-10-12 06:44:27 +08:00
|
|
|
bio->bi_iter.bi_sector + s->insert_bio_sectors,
|
2013-09-11 10:02:45 +08:00
|
|
|
s->insert_bio_sectors);
|
2013-09-11 09:39:16 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);
|
2013-09-11 09:39:16 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
s->iop.replace = true;
|
2013-09-11 09:52:54 +08:00
|
|
|
|
bcache: avoid oversized read request in cache missing code path
In the cache missing code path of cached device, if a proper location
from the internal B+ tree is matched for a cache miss range, function
cached_dev_cache_miss() will be called in cache_lookup_fn() in the
following code block,
[code block 1]
526 unsigned int sectors = KEY_INODE(k) == s->iop.inode
527 ? min_t(uint64_t, INT_MAX,
528 KEY_START(k) - bio->bi_iter.bi_sector)
529 : INT_MAX;
530 int ret = s->d->cache_miss(b, s, bio, sectors);
Here s->d->cache_miss() is the call backfunction pointer initialized as
cached_dev_cache_miss(), the last parameter 'sectors' is an important
hint to calculate the size of read request to backing device of the
missing cache data.
Current calculation in above code block may generate oversized value of
'sectors', which consequently may trigger 2 different potential kernel
panics by BUG() or BUG_ON() as listed below,
1) BUG_ON() inside bch_btree_insert_key(),
[code block 2]
886 BUG_ON(b->ops->is_extents && !KEY_SIZE(k));
2) BUG() inside biovec_slab(),
[code block 3]
51 default:
52 BUG();
53 return NULL;
All the above panics are original from cached_dev_cache_miss() by the
oversized parameter 'sectors'.
Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate
the size of data read from backing device for the cache missing. This
size is stored in s->insert_bio_sectors by the following lines of code,
[code block 4]
909 s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
Then the actual key inserting to the internal B+ tree is generated and
stored in s->iop.replace_key by the following lines of code,
[code block 5]
911 s->iop.replace_key = KEY(s->iop.inode,
912 bio->bi_iter.bi_sector + s->insert_bio_sectors,
913 s->insert_bio_sectors);
The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from
the above code block.
And the bio sending to backing device for the missing data is allocated
with hint from s->insert_bio_sectors by the following lines of code,
[code block 6]
926 cache_bio = bio_alloc_bioset(GFP_NOWAIT,
927 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS),
928 &dc->disk.bio_split);
The oversized parameter 'sectors' may trigger panic 2) by BUG() from the
agove code block.
Now let me explain how the panics happen with the oversized 'sectors'.
In code block 5, replace_key is generated by macro KEY(). From the
definition of macro KEY(),
[code block 7]
71 #define KEY(inode, offset, size) \
72 ((struct bkey) { \
73 .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode), \
74 .low = (offset) \
75 })
Here 'size' is 16bits width embedded in 64bits member 'high' of struct
bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is
very probably to be larger than (1<<16) - 1, which makes the bkey size
calculation in code block 5 is overflowed. In one bug report the value
of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors'
results the overflowed s->insert_bio_sectors in code block 4, then makes
size field of s->iop.replace_key to be 0 in code block 5. Then the 0-
sized s->iop.replace_key is inserted into the internal B+ tree as cache
missing check key (a special key to detect and avoid a racing between
normal write request and cache missing read request) as,
[code block 8]
915 ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);
Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey
size check BUG_ON() in code block 2, and causes the kernel panic 1).
Another kernel panic is from code block 6, is by the bvecs number
oversized value s->insert_bio_sectors from code block 4,
min(sectors, bio_sectors(bio) + reada)
There are two possibility for oversized reresult,
- bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized.
- sectors < bio_sectors(bio) + reada, but sectors is oversized.
From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors,
PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many
other values which larther than BIO_MAX_VECS (a.k.a 256). When calling
bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter,
this value will eventually be sent to biovec_slab() as parameter
'nr_vecs' in following code path,
bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab()
Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG()
in code block 3 is triggered inside biovec_slab().
From the above analysis, we know that the 4th parameter 'sector' sent
into cached_dev_cache_miss() may cause overflow in code block 5 and 6,
and finally cause kernel panic in code block 2 and 3. And if result of
bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger
kernel panic in code block 3 from code block 6.
Now the almost-useless readahead size for cache missing request back to
backing device is removed, this patch can fix the oversized issue with
more simpler method.
- add a local variable size_limit, set it by the minimum value from
the max bkey size and max bio bvecs number.
- set s->insert_bio_sectors by the minimum value from size_limit,
sectors, and the sectors size of bio.
- replace sectors by s->insert_bio_sectors to do bio_next_split.
By the above method with size_limit, s->insert_bio_sectors will never
result oversized replace_key size or bio bvecs number. And split bio
'miss' from bio_next_split() will always match the size of 'cache_bio',
that is the current maximum bio size we can sent to backing device for
fetching the cache missing data.
Current problmatic code can be partially found since Linux v3.13-rc1,
therefore all maintained stable kernels should try to apply this fix.
Reported-by: Alexander Ullrich <ealex1979@gmail.com>
Reported-by: Diego Ercolani <diego.ercolani@gmail.com>
Reported-by: Jan Szubiak <jan.szubiak@linuxpolska.pl>
Reported-by: Marco Rebhan <me@dblsaiko.net>
Reported-by: Matthias Ferdinand <bcache@mfedv.net>
Reported-by: Victor Westerhuis <victor@westerhu.is>
Reported-by: Vojtech Pavlik <vojtech@suse.cz>
Reported-and-tested-by: Rolf Fokkens <rolf@rolffokkens.nl>
Reported-and-tested-by: Thorsten Knabe <linux@thorsten-knabe.de>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Takashi Iwai <tiwai@suse.com>
Link: https://lore.kernel.org/r/20210607125052.21277-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-07 20:50:52 +08:00
|
|
|
miss = bio_next_split(bio, s->insert_bio_sectors, GFP_NOIO,
|
|
|
|
&s->d->bio_split);
|
2013-07-25 08:41:08 +08:00
|
|
|
|
|
|
|
/* btree_search_recurse()'s btree iterator is no good anymore */
|
|
|
|
ret = miss == bio ? MAP_DONE : -EINTR;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2022-01-24 17:11:03 +08:00
|
|
|
cache_bio = bio_alloc_bioset(miss->bi_bdev,
|
2013-09-11 10:02:45 +08:00
|
|
|
DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS),
|
2022-01-24 17:11:03 +08:00
|
|
|
0, GFP_NOWAIT, &dc->disk.bio_split);
|
2013-09-11 08:06:17 +08:00
|
|
|
if (!cache_bio)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto out_submit;
|
|
|
|
|
2013-10-12 06:44:27 +08:00
|
|
|
cache_bio->bi_iter.bi_sector = miss->bi_iter.bi_sector;
|
|
|
|
cache_bio->bi_iter.bi_size = s->insert_bio_sectors << 9;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2018-03-19 08:36:24 +08:00
|
|
|
cache_bio->bi_end_io = backing_request_endio;
|
2013-09-11 08:06:17 +08:00
|
|
|
cache_bio->bi_private = &s->cl;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 08:06:17 +08:00
|
|
|
bch_bio_map(cache_bio, NULL);
|
2017-12-18 20:22:10 +08:00
|
|
|
if (bch_bio_alloc_pages(cache_bio, __GFP_NOWARN|GFP_NOIO))
|
2013-03-24 07:11:31 +08:00
|
|
|
goto out_put;
|
|
|
|
|
2013-09-11 08:06:17 +08:00
|
|
|
s->cache_miss = miss;
|
2013-09-11 10:02:45 +08:00
|
|
|
s->iop.bio = cache_bio;
|
2013-09-11 08:06:17 +08:00
|
|
|
bio_get(cache_bio);
|
2018-03-19 08:36:24 +08:00
|
|
|
/* I/O request sent to backing device */
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
closure_bio_submit(s->iop.c, cache_bio, &s->cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
out_put:
|
2013-09-11 08:06:17 +08:00
|
|
|
bio_put(cache_bio);
|
2013-03-24 07:11:31 +08:00
|
|
|
out_submit:
|
2018-03-19 08:36:24 +08:00
|
|
|
miss->bi_end_io = backing_request_endio;
|
2013-09-11 09:39:16 +08:00
|
|
|
miss->bi_private = &s->cl;
|
2018-03-19 08:36:24 +08:00
|
|
|
/* I/O request sent to backing device */
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
closure_bio_submit(s->iop.c, miss, &s->cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-09-11 08:06:17 +08:00
|
|
|
static void cached_dev_read(struct cached_dev *dc, struct search *s)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct closure *cl = &s->cl;
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
closure_call(&s->iop.cl, cache_lookup, NULL, cl);
|
2013-09-11 08:06:17 +08:00
|
|
|
continue_at(cl, cached_dev_read_done_bh, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Process writes */
|
|
|
|
|
|
|
|
static void cached_dev_write_complete(struct closure *cl)
|
|
|
|
{
|
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
|
|
|
struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
|
|
|
|
|
|
|
|
up_read_non_owner(&dc->writeback_lock);
|
|
|
|
cached_dev_bio_complete(cl);
|
|
|
|
}
|
|
|
|
|
2013-09-11 08:06:17 +08:00
|
|
|
static void cached_dev_write(struct cached_dev *dc, struct search *s)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct closure *cl = &s->cl;
|
|
|
|
struct bio *bio = &s->bio.bio;
|
2013-10-12 06:44:27 +08:00
|
|
|
struct bkey start = KEY(dc->disk.id, bio->bi_iter.bi_sector, 0);
|
2013-07-25 08:24:52 +08:00
|
|
|
struct bkey end = KEY(dc->disk.id, bio_end_sector(bio), 0);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
bch_keybuf_check_overlapping(&s->iop.c->moving_gc_keys, &start, &end);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
down_read_non_owner(&dc->writeback_lock);
|
|
|
|
if (bch_keybuf_check_overlapping(&dc->writeback_keys, &start, &end)) {
|
2013-07-25 08:24:52 +08:00
|
|
|
/*
|
|
|
|
* We overlap with some dirty data undergoing background
|
|
|
|
* writeback, force this write to writeback
|
|
|
|
*/
|
2013-09-11 10:02:45 +08:00
|
|
|
s->iop.bypass = false;
|
|
|
|
s->iop.writeback = true;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-25 08:24:52 +08:00
|
|
|
/*
|
|
|
|
* Discards aren't _required_ to do anything, so skipping if
|
|
|
|
* check_overlapping returned true is ok
|
|
|
|
*
|
|
|
|
* But check_overlapping drops dirty keys for which io hasn't started,
|
|
|
|
* so we still want to call it.
|
|
|
|
*/
|
2016-06-06 03:32:05 +08:00
|
|
|
if (bio_op(bio) == REQ_OP_DISCARD)
|
2013-09-11 10:02:45 +08:00
|
|
|
s->iop.bypass = true;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-06-05 21:24:39 +08:00
|
|
|
if (should_writeback(dc, s->orig_bio,
|
2017-10-14 07:35:34 +08:00
|
|
|
cache_mode(dc),
|
2013-09-11 10:02:45 +08:00
|
|
|
s->iop.bypass)) {
|
|
|
|
s->iop.bypass = false;
|
|
|
|
s->iop.writeback = true;
|
2013-06-05 21:24:39 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (s->iop.bypass) {
|
|
|
|
s->iop.bio = s->orig_bio;
|
|
|
|
bio_get(s->iop.bio);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2018-03-19 08:36:24 +08:00
|
|
|
if (bio_op(bio) == REQ_OP_DISCARD &&
|
2022-04-15 12:52:55 +08:00
|
|
|
!bdev_max_discard_sectors(dc->bdev))
|
2018-03-19 08:36:24 +08:00
|
|
|
goto insert_data;
|
|
|
|
|
|
|
|
/* I/O request sent to backing device */
|
|
|
|
bio->bi_end_io = backing_request_endio;
|
|
|
|
closure_bio_submit(s->iop.c, bio, cl);
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
} else if (s->iop.writeback) {
|
2013-06-05 21:21:07 +08:00
|
|
|
bch_writeback_add(dc);
|
2013-09-11 10:02:45 +08:00
|
|
|
s->iop.bio = bio;
|
2013-06-27 08:25:38 +08:00
|
|
|
|
2016-08-06 05:35:16 +08:00
|
|
|
if (bio->bi_opf & REQ_PREFLUSH) {
|
2018-03-19 08:36:24 +08:00
|
|
|
/*
|
|
|
|
* Also need to send a flush to the backing
|
|
|
|
* device.
|
|
|
|
*/
|
|
|
|
struct bio *flush;
|
|
|
|
|
2022-01-24 17:11:03 +08:00
|
|
|
flush = bio_alloc_bioset(bio->bi_bdev, 0,
|
|
|
|
REQ_OP_WRITE | REQ_PREFLUSH,
|
|
|
|
GFP_NOIO, &dc->disk.bio_split);
|
2018-03-19 08:36:24 +08:00
|
|
|
if (!flush) {
|
|
|
|
s->iop.status = BLK_STS_RESOURCE;
|
|
|
|
goto insert_data;
|
|
|
|
}
|
|
|
|
flush->bi_end_io = backing_request_endio;
|
2013-09-24 14:17:36 +08:00
|
|
|
flush->bi_private = cl;
|
2018-03-19 08:36:24 +08:00
|
|
|
/* I/O request sent to backing device */
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
closure_bio_submit(s->iop.c, flush, cl);
|
2013-06-27 08:25:38 +08:00
|
|
|
}
|
2013-07-25 08:24:52 +08:00
|
|
|
} else {
|
2022-02-03 00:01:09 +08:00
|
|
|
s->iop.bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
|
|
|
|
&dc->disk.bio_split);
|
2018-03-19 08:36:24 +08:00
|
|
|
/* I/O request sent to backing device */
|
|
|
|
bio->bi_end_io = backing_request_endio;
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
closure_bio_submit(s->iop.c, bio, cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
2013-07-25 08:24:52 +08:00
|
|
|
|
2018-03-19 08:36:24 +08:00
|
|
|
insert_data:
|
2013-09-11 10:02:45 +08:00
|
|
|
closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
continue_at(cl, cached_dev_write_complete, NULL);
|
|
|
|
}
|
|
|
|
|
2013-10-25 08:07:04 +08:00
|
|
|
static void cached_dev_nodata(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-10-25 08:07:04 +08:00
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
struct bio *bio = &s->bio.bio;
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (s->iop.flush_journal)
|
|
|
|
bch_journal_meta(s->iop.c, cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:24:52 +08:00
|
|
|
/* If it's a flush, we send the flush to the backing device too */
|
2018-03-19 08:36:24 +08:00
|
|
|
bio->bi_end_io = backing_request_endio;
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
closure_bio_submit(s->iop.c, bio, cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
continue_at(cl, cached_dev_bio_complete, NULL);
|
|
|
|
}
|
|
|
|
|
2018-03-19 08:36:19 +08:00
|
|
|
struct detached_dev_io_private {
|
|
|
|
struct bcache_device *d;
|
|
|
|
unsigned long start_time;
|
|
|
|
bio_end_io_t *bi_end_io;
|
|
|
|
void *bi_private;
|
2021-01-24 18:02:37 +08:00
|
|
|
struct block_device *orig_bdev;
|
2018-03-19 08:36:19 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
static void detached_dev_end_io(struct bio *bio)
|
|
|
|
{
|
|
|
|
struct detached_dev_io_private *ddip;
|
|
|
|
|
|
|
|
ddip = bio->bi_private;
|
|
|
|
bio->bi_end_io = ddip->bi_end_io;
|
|
|
|
bio->bi_private = ddip->bi_private;
|
|
|
|
|
2020-07-28 21:59:20 +08:00
|
|
|
/* Count on the bcache device */
|
2021-01-24 18:02:37 +08:00
|
|
|
bio_end_io_acct_remapped(bio, ddip->start_time, ddip->orig_bdev);
|
2018-03-19 08:36:19 +08:00
|
|
|
|
bcache: add io_disable to struct cached_dev
If a bcache device is configured to writeback mode, current code does not
handle write I/O errors on backing devices properly.
In writeback mode, write request is written to cache device, and
latter being flushed to backing device. If I/O failed when writing from
cache device to the backing device, bcache code just ignores the error and
upper layer code is NOT noticed that the backing device is broken.
This patch tries to handle backing device failure like how the cache device
failure is handled,
- Add a error counter 'io_errors' and error limit 'error_limit' in struct
cached_dev. Add another io_disable to struct cached_dev to disable I/Os
on the problematic backing device.
- When I/O error happens on backing device, increase io_errors counter. And
if io_errors reaches error_limit, set cache_dev->io_disable to true, and
stop the bcache device.
The result is, if backing device is broken of disconnected, and I/O errors
reach its error limit, backing device will be disabled and the associated
bcache device will be removed from system.
Changelog:
v2: remove "bcache: " prefix in pr_error(), and use correct name string to
print out bcache device gendisk name.
v1: indeed this is new added in v2 patch set.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:25 +08:00
|
|
|
if (bio->bi_status) {
|
|
|
|
struct cached_dev *dc = container_of(ddip->d,
|
|
|
|
struct cached_dev, disk);
|
|
|
|
/* should count I/O error for backing device here */
|
|
|
|
bch_count_backing_io_errors(dc, bio);
|
|
|
|
}
|
2018-03-19 08:36:19 +08:00
|
|
|
|
bcache: add io_disable to struct cached_dev
If a bcache device is configured to writeback mode, current code does not
handle write I/O errors on backing devices properly.
In writeback mode, write request is written to cache device, and
latter being flushed to backing device. If I/O failed when writing from
cache device to the backing device, bcache code just ignores the error and
upper layer code is NOT noticed that the backing device is broken.
This patch tries to handle backing device failure like how the cache device
failure is handled,
- Add a error counter 'io_errors' and error limit 'error_limit' in struct
cached_dev. Add another io_disable to struct cached_dev to disable I/Os
on the problematic backing device.
- When I/O error happens on backing device, increase io_errors counter. And
if io_errors reaches error_limit, set cache_dev->io_disable to true, and
stop the bcache device.
The result is, if backing device is broken of disconnected, and I/O errors
reach its error limit, backing device will be disabled and the associated
bcache device will be removed from system.
Changelog:
v2: remove "bcache: " prefix in pr_error(), and use correct name string to
print out bcache device gendisk name.
v1: indeed this is new added in v2 patch set.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:25 +08:00
|
|
|
kfree(ddip);
|
2018-03-19 08:36:19 +08:00
|
|
|
bio->bi_end_io(bio);
|
|
|
|
}
|
|
|
|
|
2021-01-24 18:02:37 +08:00
|
|
|
static void detached_dev_do_request(struct bcache_device *d, struct bio *bio,
|
|
|
|
struct block_device *orig_bdev, unsigned long start_time)
|
2018-03-19 08:36:19 +08:00
|
|
|
{
|
|
|
|
struct detached_dev_io_private *ddip;
|
|
|
|
struct cached_dev *dc = container_of(d, struct cached_dev, disk);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* no need to call closure_get(&dc->disk.cl),
|
|
|
|
* because upper layer had already opened bcache device,
|
|
|
|
* which would call closure_get(&dc->disk.cl)
|
|
|
|
*/
|
|
|
|
ddip = kzalloc(sizeof(struct detached_dev_io_private), GFP_NOIO);
|
2022-05-27 23:28:18 +08:00
|
|
|
if (!ddip) {
|
|
|
|
bio->bi_status = BLK_STS_RESOURCE;
|
|
|
|
bio->bi_end_io(bio);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2018-03-19 08:36:19 +08:00
|
|
|
ddip->d = d;
|
2020-07-28 21:59:20 +08:00
|
|
|
/* Count on the bcache device */
|
2021-01-24 18:02:37 +08:00
|
|
|
ddip->orig_bdev = orig_bdev;
|
|
|
|
ddip->start_time = start_time;
|
2018-03-19 08:36:19 +08:00
|
|
|
ddip->bi_end_io = bio->bi_end_io;
|
|
|
|
ddip->bi_private = bio->bi_private;
|
|
|
|
bio->bi_end_io = detached_dev_end_io;
|
|
|
|
bio->bi_private = ddip;
|
|
|
|
|
|
|
|
if ((bio_op(bio) == REQ_OP_DISCARD) &&
|
2022-04-15 12:52:55 +08:00
|
|
|
!bdev_max_discard_sectors(dc->bdev))
|
2018-03-19 08:36:19 +08:00
|
|
|
bio->bi_end_io(bio);
|
|
|
|
else
|
2020-07-01 16:59:44 +08:00
|
|
|
submit_bio_noacct(bio);
|
2018-03-19 08:36:19 +08:00
|
|
|
}
|
|
|
|
|
2018-08-09 15:48:49 +08:00
|
|
|
static void quit_max_writeback_rate(struct cache_set *c,
|
|
|
|
struct cached_dev *this_dc)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct bcache_device *d;
|
|
|
|
struct cached_dev *dc;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* mutex bch_register_lock may compete with other parallel requesters,
|
|
|
|
* or attach/detach operations on other backing device. Waiting to
|
|
|
|
* the mutex lock may increase I/O request latency for seconds or more.
|
|
|
|
* To avoid such situation, if mutext_trylock() failed, only writeback
|
|
|
|
* rate of current cached device is set to 1, and __update_write_back()
|
|
|
|
* will decide writeback rate of other cached devices (remember now
|
|
|
|
* c->idle_counter is 0 already).
|
|
|
|
*/
|
|
|
|
if (mutex_trylock(&bch_register_lock)) {
|
|
|
|
for (i = 0; i < c->devices_max_used; i++) {
|
|
|
|
if (!c->devices[i])
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (UUID_FLASH_ONLY(&c->uuids[i]))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
d = c->devices[i];
|
|
|
|
dc = container_of(d, struct cached_dev, disk);
|
|
|
|
/*
|
|
|
|
* set writeback rate to default minimum value,
|
|
|
|
* then let update_writeback_rate() to decide the
|
|
|
|
* upcoming rate.
|
|
|
|
*/
|
|
|
|
atomic_long_set(&dc->writeback_rate.rate, 1);
|
|
|
|
}
|
|
|
|
mutex_unlock(&bch_register_lock);
|
|
|
|
} else
|
|
|
|
atomic_long_set(&this_dc->writeback_rate.rate, 1);
|
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Cached devices - read & write stuff */
|
|
|
|
|
2021-10-12 19:12:24 +08:00
|
|
|
void cached_dev_submit_bio(struct bio *bio)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct search *s;
|
2021-01-24 18:02:37 +08:00
|
|
|
struct block_device *orig_bdev = bio->bi_bdev;
|
|
|
|
struct bcache_device *d = orig_bdev->bd_disk->private_data;
|
2013-03-24 07:11:31 +08:00
|
|
|
struct cached_dev *dc = container_of(d, struct cached_dev, disk);
|
2021-01-24 18:02:37 +08:00
|
|
|
unsigned long start_time;
|
2014-11-24 11:05:24 +08:00
|
|
|
int rw = bio_data_dir(bio);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
bcache: add io_disable to struct cached_dev
If a bcache device is configured to writeback mode, current code does not
handle write I/O errors on backing devices properly.
In writeback mode, write request is written to cache device, and
latter being flushed to backing device. If I/O failed when writing from
cache device to the backing device, bcache code just ignores the error and
upper layer code is NOT noticed that the backing device is broken.
This patch tries to handle backing device failure like how the cache device
failure is handled,
- Add a error counter 'io_errors' and error limit 'error_limit' in struct
cached_dev. Add another io_disable to struct cached_dev to disable I/Os
on the problematic backing device.
- When I/O error happens on backing device, increase io_errors counter. And
if io_errors reaches error_limit, set cache_dev->io_disable to true, and
stop the bcache device.
The result is, if backing device is broken of disconnected, and I/O errors
reach its error limit, backing device will be disabled and the associated
bcache device will be removed from system.
Changelog:
v2: remove "bcache: " prefix in pr_error(), and use correct name string to
print out bcache device gendisk name.
v1: indeed this is new added in v2 patch set.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:25 +08:00
|
|
|
if (unlikely((d->c && test_bit(CACHE_SET_IO_DISABLE, &d->c->flags)) ||
|
|
|
|
dc->io_disable)) {
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
bio->bi_status = BLK_STS_IOERR;
|
|
|
|
bio_endio(bio);
|
2021-10-12 19:12:24 +08:00
|
|
|
return;
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
}
|
|
|
|
|
2018-08-09 15:48:49 +08:00
|
|
|
if (likely(d->c)) {
|
|
|
|
if (atomic_read(&d->c->idle_counter))
|
|
|
|
atomic_set(&d->c->idle_counter, 0);
|
|
|
|
/*
|
|
|
|
* If at_max_writeback_rate of cache set is true and new I/O
|
|
|
|
* comes, quit max writeback rate of all cached devices
|
|
|
|
* attached to this cache set, and set at_max_writeback_rate
|
|
|
|
* to false.
|
|
|
|
*/
|
|
|
|
if (unlikely(atomic_read(&d->c->at_max_writeback_rate) == 1)) {
|
|
|
|
atomic_set(&d->c->at_max_writeback_rate, 0);
|
|
|
|
quit_max_writeback_rate(d->c, dc);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-01-24 18:02:37 +08:00
|
|
|
start_time = bio_start_io_acct(bio);
|
|
|
|
|
2017-08-24 01:10:32 +08:00
|
|
|
bio_set_dev(bio, dc->bdev);
|
2013-10-12 06:44:27 +08:00
|
|
|
bio->bi_iter.bi_sector += dc->sb.data_offset;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (cached_dev_get(dc)) {
|
2021-01-24 18:02:37 +08:00
|
|
|
s = search_alloc(bio, d, orig_bdev, start_time);
|
2013-09-11 10:02:45 +08:00
|
|
|
trace_bcache_request_start(s->d, bio);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-10-12 06:44:27 +08:00
|
|
|
if (!bio->bi_iter.bi_size) {
|
2013-10-25 08:07:04 +08:00
|
|
|
/*
|
|
|
|
* can't call bch_journal_meta from under
|
2020-07-01 16:59:44 +08:00
|
|
|
* submit_bio_noacct
|
2013-10-25 08:07:04 +08:00
|
|
|
*/
|
|
|
|
continue_at_nobarrier(&s->cl,
|
|
|
|
cached_dev_nodata,
|
|
|
|
bcache_wq);
|
|
|
|
} else {
|
2013-09-11 10:02:45 +08:00
|
|
|
s->iop.bypass = check_should_bypass(dc, bio);
|
2013-07-25 08:24:52 +08:00
|
|
|
|
|
|
|
if (rw)
|
2013-09-11 08:06:17 +08:00
|
|
|
cached_dev_write(dc, s);
|
2013-07-25 08:24:52 +08:00
|
|
|
else
|
2013-09-11 08:06:17 +08:00
|
|
|
cached_dev_read(dc, s);
|
2013-07-25 08:24:52 +08:00
|
|
|
}
|
2018-03-19 08:36:19 +08:00
|
|
|
} else
|
2018-03-19 08:36:24 +08:00
|
|
|
/* I/O request sent to backing device */
|
2021-01-24 18:02:37 +08:00
|
|
|
detached_dev_do_request(d, bio, orig_bdev, start_time);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
|
|
|
|
unsigned int cmd, unsigned long arg)
|
|
|
|
{
|
|
|
|
struct cached_dev *dc = container_of(d, struct cached_dev, disk);
|
2018-08-11 13:19:45 +08:00
|
|
|
|
2018-10-08 20:41:10 +08:00
|
|
|
if (dc->io_disable)
|
|
|
|
return -EIO;
|
2020-11-03 18:00:18 +08:00
|
|
|
if (!dc->bdev->bd_disk->fops->ioctl)
|
|
|
|
return -ENOTTY;
|
|
|
|
return dc->bdev->bd_disk->fops->ioctl(dc->bdev, mode, cmd, arg);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void bch_cached_dev_request_init(struct cached_dev *dc)
|
|
|
|
{
|
|
|
|
dc->disk.cache_miss = cached_dev_cache_miss;
|
|
|
|
dc->disk.ioctl = cached_dev_ioctl;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Flash backed devices */
|
|
|
|
|
|
|
|
static int flash_dev_cache_miss(struct btree *b, struct search *s,
|
2018-08-11 13:19:44 +08:00
|
|
|
struct bio *bio, unsigned int sectors)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int bytes = min(sectors, bio_sectors(bio)) << 9;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-01-17 07:04:18 +08:00
|
|
|
swap(bio->bi_iter.bi_size, bytes);
|
|
|
|
zero_fill_bio(bio);
|
|
|
|
swap(bio->bi_iter.bi_size, bytes);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-01-17 07:04:18 +08:00
|
|
|
bio_advance(bio, bytes);
|
2013-06-07 09:15:57 +08:00
|
|
|
|
2013-10-12 06:44:27 +08:00
|
|
|
if (!bio->bi_iter.bi_size)
|
2013-07-25 08:41:08 +08:00
|
|
|
return MAP_DONE;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:41:08 +08:00
|
|
|
return MAP_CONTINUE;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-10-25 08:07:04 +08:00
|
|
|
static void flash_dev_nodata(struct closure *cl)
|
|
|
|
{
|
|
|
|
struct search *s = container_of(cl, struct search, cl);
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
if (s->iop.flush_journal)
|
|
|
|
bch_journal_meta(s->iop.c, cl);
|
2013-10-25 08:07:04 +08:00
|
|
|
|
|
|
|
continue_at(cl, search_free, NULL);
|
|
|
|
}
|
|
|
|
|
2021-10-12 19:12:24 +08:00
|
|
|
void flash_dev_submit_bio(struct bio *bio)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct search *s;
|
|
|
|
struct closure *cl;
|
2021-01-24 18:02:34 +08:00
|
|
|
struct bcache_device *d = bio->bi_bdev->bd_disk->private_data;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
if (unlikely(d->c && test_bit(CACHE_SET_IO_DISABLE, &d->c->flags))) {
|
|
|
|
bio->bi_status = BLK_STS_IOERR;
|
|
|
|
bio_endio(bio);
|
2021-10-12 19:12:24 +08:00
|
|
|
return;
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
}
|
|
|
|
|
2021-01-24 18:02:37 +08:00
|
|
|
s = search_alloc(bio, d, bio->bi_bdev, bio_start_io_acct(bio));
|
2013-03-24 07:11:31 +08:00
|
|
|
cl = &s->cl;
|
|
|
|
bio = &s->bio.bio;
|
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
trace_bcache_request_start(s->d, bio);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-10-12 06:44:27 +08:00
|
|
|
if (!bio->bi_iter.bi_size) {
|
2013-10-25 08:07:04 +08:00
|
|
|
/*
|
2020-07-01 16:59:44 +08:00
|
|
|
* can't call bch_journal_meta from under submit_bio_noacct
|
2013-10-25 08:07:04 +08:00
|
|
|
*/
|
|
|
|
continue_at_nobarrier(&s->cl,
|
|
|
|
flash_dev_nodata,
|
|
|
|
bcache_wq);
|
2021-10-12 19:12:24 +08:00
|
|
|
return;
|
2018-07-18 19:47:39 +08:00
|
|
|
} else if (bio_data_dir(bio)) {
|
2013-09-11 10:02:45 +08:00
|
|
|
bch_keybuf_check_overlapping(&s->iop.c->moving_gc_keys,
|
2013-10-12 06:44:27 +08:00
|
|
|
&KEY(d->id, bio->bi_iter.bi_sector, 0),
|
2013-06-07 09:15:57 +08:00
|
|
|
&KEY(d->id, bio_end_sector(bio), 0));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2016-06-06 03:32:05 +08:00
|
|
|
s->iop.bypass = (bio_op(bio) == REQ_OP_DISCARD) != 0;
|
2013-09-11 10:02:45 +08:00
|
|
|
s->iop.writeback = true;
|
|
|
|
s->iop.bio = bio;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:02:45 +08:00
|
|
|
closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
} else {
|
2013-09-11 10:02:45 +08:00
|
|
|
closure_call(&s->iop.cl, cache_lookup, NULL, cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
continue_at(cl, search_free, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int flash_dev_ioctl(struct bcache_device *d, fmode_t mode,
|
|
|
|
unsigned int cmd, unsigned long arg)
|
|
|
|
{
|
|
|
|
return -ENOTTY;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bch_flash_dev_request_init(struct bcache_device *d)
|
|
|
|
{
|
|
|
|
d->cache_miss = flash_dev_cache_miss;
|
|
|
|
d->ioctl = flash_dev_ioctl;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bch_request_exit(void)
|
|
|
|
{
|
2018-08-11 13:19:54 +08:00
|
|
|
kmem_cache_destroy(bch_search_cache);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
int __init bch_request_init(void)
|
|
|
|
{
|
|
|
|
bch_search_cache = KMEM_CACHE(search, 0);
|
|
|
|
if (!bch_search_cache)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|