License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2013-03-24 07:11:31 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2010 Kent Overstreet <kent.overstreet@gmail.com>
|
|
|
|
*
|
|
|
|
* Uses a block device as cache for other block devices; optimized for SSDs.
|
|
|
|
* All allocation is done in buckets, which should match the erase block size
|
|
|
|
* of the device.
|
|
|
|
*
|
|
|
|
* Buckets containing cached data are kept on a heap sorted by priority;
|
|
|
|
* bucket priority is increased on cache hit, and periodically all the buckets
|
|
|
|
* on the heap have their priority scaled down. This currently is just used as
|
|
|
|
* an LRU but in the future should allow for more intelligent heuristics.
|
|
|
|
*
|
|
|
|
* Buckets have an 8 bit counter; freeing is accomplished by incrementing the
|
|
|
|
* counter. Garbage collection is used to remove stale pointers.
|
|
|
|
*
|
|
|
|
* Indexing is done via a btree; nodes are not necessarily fully sorted, rather
|
|
|
|
* as keys are inserted we only sort the pages that have not yet been written.
|
|
|
|
* When garbage collection is run, we resort the entire node.
|
|
|
|
*
|
|
|
|
* All configuration is done via sysfs; see Documentation/bcache.txt.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include "bcache.h"
|
|
|
|
#include "btree.h"
|
|
|
|
#include "debug.h"
|
2013-12-21 09:22:05 +08:00
|
|
|
#include "extents.h"
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/bitops.h>
|
|
|
|
#include <linux/hash.h>
|
2013-10-25 08:19:26 +08:00
|
|
|
#include <linux/kthread.h>
|
2013-03-28 01:56:28 +08:00
|
|
|
#include <linux/prefetch.h>
|
2013-03-24 07:11:31 +08:00
|
|
|
#include <linux/random.h>
|
|
|
|
#include <linux/rcupdate.h>
|
2017-02-01 23:36:40 +08:00
|
|
|
#include <linux/sched/clock.h>
|
2017-02-04 08:27:20 +08:00
|
|
|
#include <linux/rculist.h>
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
#include <trace/events/bcache.h>
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Todo:
|
|
|
|
* register_bcache: Return errors out to userspace correctly
|
|
|
|
*
|
|
|
|
* Writeback: don't undirty key until after a cache flush
|
|
|
|
*
|
|
|
|
* Create an iterator for key pointers
|
|
|
|
*
|
|
|
|
* On btree write error, mark bucket such that it won't be freed from the cache
|
|
|
|
*
|
|
|
|
* Journalling:
|
|
|
|
* Check for bad keys in replay
|
|
|
|
* Propagate barriers
|
|
|
|
* Refcount journal entries in journal_replay
|
|
|
|
*
|
|
|
|
* Garbage collection:
|
|
|
|
* Finish incremental gc
|
|
|
|
* Gc should free old UUIDs, data for invalid UUIDs
|
|
|
|
*
|
|
|
|
* Provide a way to list backing device UUIDs we have data cached for, and
|
|
|
|
* probably how long it's been since we've seen them, and a way to invalidate
|
|
|
|
* dirty data for devices that will never be attached again
|
|
|
|
*
|
|
|
|
* Keep 1 min/5 min/15 min statistics of how busy a block device has been, so
|
|
|
|
* that based on that and how much dirty data we have we can keep writeback
|
|
|
|
* from being starved
|
|
|
|
*
|
|
|
|
* Add a tracepoint or somesuch to watch for writeback starvation
|
|
|
|
*
|
|
|
|
* When btree depth > 1 and splitting an interior node, we have to make sure
|
|
|
|
* alloc_bucket() cannot fail. This should be true but is not completely
|
|
|
|
* obvious.
|
|
|
|
*
|
|
|
|
* Plugging?
|
|
|
|
*
|
|
|
|
* If data write is less than hard sector size of ssd, round up offset in open
|
|
|
|
* bucket to the next whole sector
|
|
|
|
*
|
|
|
|
* Superblock needs to be fleshed out for multiple cache devices
|
|
|
|
*
|
|
|
|
* Add a sysfs tunable for the number of writeback IOs in flight
|
|
|
|
*
|
|
|
|
* Add a sysfs tunable for the number of open data buckets
|
|
|
|
*
|
|
|
|
* IO tracking: Can we track when one process is doing io on behalf of another?
|
|
|
|
* IO tracking: Don't use just an average, weigh more recent stuff higher
|
|
|
|
*
|
|
|
|
* Test module load/unload
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define MAX_NEED_GC 64
|
|
|
|
#define MAX_SAVE_PRIO 72
|
|
|
|
|
|
|
|
#define PTR_DIRTY_BIT (((uint64_t) 1 << 36))
|
|
|
|
|
|
|
|
#define PTR_HASH(c, k) \
|
|
|
|
(((k)->ptr[0] >> c->bucket_bits) | PTR_GEN(k, 0))
|
|
|
|
|
2013-07-25 08:37:59 +08:00
|
|
|
#define insert_lock(s, b) ((b)->level <= (s)->lock)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* These macros are for recursing down the btree - they handle the details of
|
|
|
|
* locking and looking up nodes in the cache for you. They're best treated as
|
|
|
|
* mere syntax when reading code that uses them.
|
|
|
|
*
|
|
|
|
* op->lock determines whether we take a read or a write lock at a given depth.
|
|
|
|
* If you've got a read lock and find that you need a write lock (i.e. you're
|
|
|
|
* going to have to split), set op->lock and return -EINTR; btree_root() will
|
|
|
|
* call you again and you'll have the correct lock.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/**
|
|
|
|
* btree - recurse down the btree on a specified key
|
|
|
|
* @fn: function to call, which will be passed the child node
|
|
|
|
* @key: key to recurse on
|
|
|
|
* @b: parent btree node
|
|
|
|
* @op: pointer to struct btree_op
|
|
|
|
*/
|
|
|
|
#define btree(fn, key, b, op, ...) \
|
|
|
|
({ \
|
|
|
|
int _r, l = (b)->level - 1; \
|
|
|
|
bool _w = l <= (op)->lock; \
|
2014-07-12 15:22:53 +08:00
|
|
|
struct btree *_child = bch_btree_node_get((b)->c, op, key, l, \
|
|
|
|
_w, b); \
|
2013-07-25 08:37:59 +08:00
|
|
|
if (!IS_ERR(_child)) { \
|
|
|
|
_r = bch_btree_ ## fn(_child, op, ##__VA_ARGS__); \
|
|
|
|
rw_unlock(_w, _child); \
|
|
|
|
} else \
|
|
|
|
_r = PTR_ERR(_child); \
|
|
|
|
_r; \
|
|
|
|
})
|
|
|
|
|
|
|
|
/**
|
|
|
|
* btree_root - call a function on the root of the btree
|
|
|
|
* @fn: function to call, which will be passed the child node
|
|
|
|
* @c: cache set
|
|
|
|
* @op: pointer to struct btree_op
|
|
|
|
*/
|
|
|
|
#define btree_root(fn, c, op, ...) \
|
|
|
|
({ \
|
|
|
|
int _r = -EINTR; \
|
|
|
|
do { \
|
|
|
|
struct btree *_b = (c)->root; \
|
|
|
|
bool _w = insert_lock(op, _b); \
|
|
|
|
rw_lock(_w, _b, _b->level); \
|
|
|
|
if (_b == (c)->root && \
|
|
|
|
_w == insert_lock(op, _b)) { \
|
|
|
|
_r = bch_btree_ ## fn(_b, op, ##__VA_ARGS__); \
|
|
|
|
} \
|
|
|
|
rw_unlock(_w, _b); \
|
2014-03-18 08:15:53 +08:00
|
|
|
bch_cannibalize_unlock(c); \
|
2013-12-17 17:29:34 +08:00
|
|
|
if (_r == -EINTR) \
|
|
|
|
schedule(); \
|
2013-07-25 08:37:59 +08:00
|
|
|
} while (_r == -EINTR); \
|
|
|
|
\
|
2014-03-18 08:15:53 +08:00
|
|
|
finish_wait(&(c)->btree_cache_wait, &(op)->wait); \
|
2013-07-25 08:37:59 +08:00
|
|
|
_r; \
|
|
|
|
})
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
static inline struct bset *write_block(struct btree *b)
|
|
|
|
{
|
|
|
|
return ((void *) btree_bset_first(b)) + b->written * block_bytes(b->c);
|
|
|
|
}
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
static void bch_btree_init_next(struct btree *b)
|
|
|
|
{
|
|
|
|
/* If not a leaf node, always sort */
|
|
|
|
if (b->level && b->keys.nsets)
|
|
|
|
bch_btree_sort(&b->keys, &b->c->sort);
|
|
|
|
else
|
|
|
|
bch_btree_sort_lazy(&b->keys, &b->c->sort);
|
|
|
|
|
|
|
|
if (b->written < btree_blocks(b))
|
|
|
|
bch_bset_init_next(&b->keys, write_block(b),
|
|
|
|
bset_magic(&b->c->sb));
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Btree key manipulation */
|
|
|
|
|
2013-07-25 07:46:42 +08:00
|
|
|
void bkey_put(struct cache_set *c, struct bkey *k)
|
2013-09-11 09:39:16 +08:00
|
|
|
{
|
|
|
|
unsigned i;
|
|
|
|
|
|
|
|
for (i = 0; i < KEY_PTRS(k); i++)
|
|
|
|
if (ptr_available(c, k, i))
|
|
|
|
atomic_dec_bug(&PTR_BUCKET(c, k, i)->pin);
|
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Btree IO */
|
|
|
|
|
|
|
|
static uint64_t btree_csum_set(struct btree *b, struct bset *i)
|
|
|
|
{
|
|
|
|
uint64_t crc = b->key.ptr[0];
|
2013-12-18 13:56:21 +08:00
|
|
|
void *data = (void *) i + 8, *end = bset_bkey_last(i);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-03-29 02:50:55 +08:00
|
|
|
crc = bch_crc64_update(crc, data, end - data);
|
2013-03-27 04:49:02 +08:00
|
|
|
return crc ^ 0xffffffffffffffffULL;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-12-18 14:49:08 +08:00
|
|
|
void bch_btree_node_read_done(struct btree *b)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
const char *err = "bad btree header";
|
2013-12-18 15:49:49 +08:00
|
|
|
struct bset *i = btree_bset_first(b);
|
2013-04-26 04:58:35 +08:00
|
|
|
struct btree_iter *iter;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-05-19 23:57:55 +08:00
|
|
|
iter = mempool_alloc(b->c->fill_iter, GFP_NOIO);
|
2013-04-26 04:58:35 +08:00
|
|
|
iter->size = b->c->sb.bucket_size / b->c->sb.block_size;
|
2013-03-24 07:11:31 +08:00
|
|
|
iter->used = 0;
|
|
|
|
|
2013-10-25 07:36:03 +08:00
|
|
|
#ifdef CONFIG_BCACHE_DEBUG
|
2013-11-12 09:35:24 +08:00
|
|
|
iter->b = &b->keys;
|
2013-10-25 07:36:03 +08:00
|
|
|
#endif
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
if (!i->seq)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
|
|
|
for (;
|
2013-12-21 09:28:16 +08:00
|
|
|
b->written < btree_blocks(b) && i->seq == b->keys.set[0].data->seq;
|
2013-03-24 07:11:31 +08:00
|
|
|
i = write_block(b)) {
|
|
|
|
err = "unsupported bset version";
|
|
|
|
if (i->version > BCACHE_BSET_VERSION)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
err = "bad btree header";
|
2013-12-18 15:49:49 +08:00
|
|
|
if (b->written + set_blocks(i, block_bytes(b->c)) >
|
|
|
|
btree_blocks(b))
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
|
|
|
err = "bad magic";
|
2013-11-01 06:46:42 +08:00
|
|
|
if (i->magic != bset_magic(&b->c->sb))
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
|
|
|
err = "bad checksum";
|
|
|
|
switch (i->version) {
|
|
|
|
case 0:
|
|
|
|
if (i->csum != csum_set(i))
|
|
|
|
goto err;
|
|
|
|
break;
|
|
|
|
case BCACHE_BSET_VERSION:
|
|
|
|
if (i->csum != btree_csum_set(b, i))
|
|
|
|
goto err;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = "empty set";
|
2013-12-21 09:28:16 +08:00
|
|
|
if (i != b->keys.set[0].data && !i->keys)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
2013-12-18 13:56:21 +08:00
|
|
|
bch_btree_iter_push(iter, i->start, bset_bkey_last(i));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
b->written += set_blocks(i, block_bytes(b->c));
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
err = "corrupted btree";
|
|
|
|
for (i = write_block(b);
|
2013-12-21 09:28:16 +08:00
|
|
|
bset_sector_offset(&b->keys, i) < KEY_SIZE(&b->key);
|
2013-03-24 07:11:31 +08:00
|
|
|
i = ((void *) i) + block_bytes(b->c))
|
2013-12-21 09:28:16 +08:00
|
|
|
if (i->seq == b->keys.set[0].data->seq)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_btree_sort_and_fix_extents(&b->keys, iter, &b->c->sort);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
i = b->keys.set[0].data;
|
2013-03-24 07:11:31 +08:00
|
|
|
err = "short btree key";
|
2013-12-21 09:28:16 +08:00
|
|
|
if (b->keys.set[0].size &&
|
|
|
|
bkey_cmp(&b->key, &b->keys.set[0].end) < 0)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
|
|
|
if (b->written < btree_blocks(b))
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_bset_init_next(&b->keys, write_block(b),
|
|
|
|
bset_magic(&b->c->sb));
|
2013-03-24 07:11:31 +08:00
|
|
|
out:
|
2013-04-26 04:58:35 +08:00
|
|
|
mempool_free(iter, b->c->fill_iter);
|
|
|
|
return;
|
2013-03-24 07:11:31 +08:00
|
|
|
err:
|
|
|
|
set_btree_node_io_error(b);
|
2013-12-18 13:46:35 +08:00
|
|
|
bch_cache_set_error(b->c, "%s at bucket %zu, block %u, %u keys",
|
2013-03-24 07:11:31 +08:00
|
|
|
err, PTR_BUCKET_NR(b->c, &b->key, 0),
|
2013-12-18 13:46:35 +08:00
|
|
|
bset_block_offset(b, i), i->keys);
|
2013-03-24 07:11:31 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2015-07-20 21:29:37 +08:00
|
|
|
static void btree_node_read_endio(struct bio *bio)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-04-26 04:58:35 +08:00
|
|
|
struct closure *cl = bio->bi_private;
|
|
|
|
closure_put(cl);
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 14:49:08 +08:00
|
|
|
static void bch_btree_node_read(struct btree *b)
|
2013-04-26 04:58:35 +08:00
|
|
|
{
|
|
|
|
uint64_t start_time = local_clock();
|
|
|
|
struct closure cl;
|
|
|
|
struct bio *bio;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_btree_read(b);
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
closure_init_stack(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
bio = bch_bbio_alloc(b->c);
|
2013-10-12 06:44:27 +08:00
|
|
|
bio->bi_iter.bi_size = KEY_SIZE(&b->key) << 9;
|
2013-04-26 04:58:35 +08:00
|
|
|
bio->bi_end_io = btree_node_read_endio;
|
|
|
|
bio->bi_private = &cl;
|
2016-11-01 21:40:10 +08:00
|
|
|
bio->bi_opf = REQ_OP_READ | REQ_META;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_bio_map(bio, b->keys.set[0].data);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
bch_submit_bbio(bio, b->c, &b->key, 0);
|
|
|
|
closure_sync(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
if (bio->bi_status)
|
2013-04-26 04:58:35 +08:00
|
|
|
set_btree_node_io_error(b);
|
|
|
|
|
|
|
|
bch_bbio_free(bio, b->c);
|
|
|
|
|
|
|
|
if (btree_node_io_error(b))
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
bch_btree_node_read_done(b);
|
|
|
|
bch_time_stats_update(&b->c->btree_read_time, start_time);
|
|
|
|
|
|
|
|
return;
|
|
|
|
err:
|
2013-09-24 14:17:30 +08:00
|
|
|
bch_cache_set_error(b->c, "io error reading bucket %zu",
|
2013-04-26 04:58:35 +08:00
|
|
|
PTR_BUCKET_NR(b->c, &b->key, 0));
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void btree_complete_write(struct btree *b, struct btree_write *w)
|
|
|
|
{
|
|
|
|
if (w->prio_blocked &&
|
|
|
|
!atomic_sub_return(w->prio_blocked, &b->c->prio_blocked))
|
2013-04-25 10:01:12 +08:00
|
|
|
wake_up_allocators(b->c);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (w->journal) {
|
|
|
|
atomic_dec_bug(w->journal);
|
|
|
|
__closure_wake_up(&b->c->journal.wait);
|
|
|
|
}
|
|
|
|
|
|
|
|
w->prio_blocked = 0;
|
|
|
|
w->journal = NULL;
|
|
|
|
}
|
|
|
|
|
2013-12-17 07:27:25 +08:00
|
|
|
static void btree_node_write_unlock(struct closure *cl)
|
|
|
|
{
|
|
|
|
struct btree *b = container_of(cl, struct btree, io);
|
|
|
|
|
|
|
|
up(&b->io_mutex);
|
|
|
|
}
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
static void __btree_node_write_done(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-12-17 07:27:25 +08:00
|
|
|
struct btree *b = container_of(cl, struct btree, io);
|
2013-03-24 07:11:31 +08:00
|
|
|
struct btree_write *w = btree_prev_write(b);
|
|
|
|
|
|
|
|
bch_bbio_free(b->bio, b->c);
|
|
|
|
b->bio = NULL;
|
|
|
|
btree_complete_write(b, w);
|
|
|
|
|
|
|
|
if (btree_node_dirty(b))
|
2014-01-23 17:44:55 +08:00
|
|
|
schedule_delayed_work(&b->work, 30 * HZ);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-17 07:27:25 +08:00
|
|
|
closure_return_with_destructor(cl, btree_node_write_unlock);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
static void btree_node_write_done(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-12-17 07:27:25 +08:00
|
|
|
struct btree *b = container_of(cl, struct btree, io);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2016-09-22 15:10:01 +08:00
|
|
|
bio_free_pages(b->bio);
|
2013-04-26 04:58:35 +08:00
|
|
|
__btree_node_write_done(cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2015-07-20 21:29:37 +08:00
|
|
|
static void btree_node_write_endio(struct bio *bio)
|
2013-04-26 04:58:35 +08:00
|
|
|
{
|
|
|
|
struct closure *cl = bio->bi_private;
|
2013-12-17 07:27:25 +08:00
|
|
|
struct btree *b = container_of(cl, struct btree, io);
|
2013-04-26 04:58:35 +08:00
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
if (bio->bi_status)
|
2013-04-26 04:58:35 +08:00
|
|
|
set_btree_node_io_error(b);
|
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
bch_bbio_count_io_errors(b->c, bio, bio->bi_status, "writing btree");
|
2013-04-26 04:58:35 +08:00
|
|
|
closure_put(cl);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void do_btree_node_write(struct btree *b)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-12-17 07:27:25 +08:00
|
|
|
struct closure *cl = &b->io;
|
2013-12-18 15:49:49 +08:00
|
|
|
struct bset *i = btree_bset_last(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
BKEY_PADDED(key) k;
|
|
|
|
|
|
|
|
i->version = BCACHE_BSET_VERSION;
|
|
|
|
i->csum = btree_csum_set(b, i);
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
BUG_ON(b->bio);
|
|
|
|
b->bio = bch_bbio_alloc(b->c);
|
|
|
|
|
|
|
|
b->bio->bi_end_io = btree_node_write_endio;
|
2013-11-02 09:03:08 +08:00
|
|
|
b->bio->bi_private = cl;
|
2013-12-18 15:49:49 +08:00
|
|
|
b->bio->bi_iter.bi_size = roundup(set_bytes(i), block_bytes(b->c));
|
2016-11-01 21:40:10 +08:00
|
|
|
b->bio->bi_opf = REQ_OP_WRITE | REQ_META | REQ_FUA;
|
2013-03-29 02:50:55 +08:00
|
|
|
bch_bio_map(b->bio, i);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-06-27 08:25:38 +08:00
|
|
|
/*
|
|
|
|
* If we're appending to a leaf node, we don't technically need FUA -
|
|
|
|
* this write just needs to be persisted before the next journal write,
|
|
|
|
* which will be marked FLUSH|FUA.
|
|
|
|
*
|
|
|
|
* Similarly if we're writing a new btree root - the pointer is going to
|
|
|
|
* be in the next journal entry.
|
|
|
|
*
|
|
|
|
* But if we're writing a new btree node (that isn't a root) or
|
|
|
|
* appending to a non leaf btree node, we need either FUA or a flush
|
|
|
|
* when we write the parent with the new pointer. FUA is cheaper than a
|
|
|
|
* flush, and writes appending to leaf nodes aren't blocking anything so
|
|
|
|
* just make all btree node writes FUA to keep things sane.
|
|
|
|
*/
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
bkey_copy(&k.key, &b->key);
|
2013-12-18 15:49:49 +08:00
|
|
|
SET_PTR_OFFSET(&k.key, 0, PTR_OFFSET(&k.key, 0) +
|
2013-12-21 09:28:16 +08:00
|
|
|
bset_sector_offset(&b->keys, i));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2017-12-18 20:22:10 +08:00
|
|
|
if (!bch_bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
int j;
|
|
|
|
struct bio_vec *bv;
|
|
|
|
void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
|
|
|
|
|
2013-11-24 09:19:00 +08:00
|
|
|
bio_for_each_segment_all(bv, b->bio, j)
|
2013-03-24 07:11:31 +08:00
|
|
|
memcpy(page_address(bv->bv_page),
|
|
|
|
base + j * PAGE_SIZE, PAGE_SIZE);
|
|
|
|
|
|
|
|
bch_submit_bbio(b->bio, b->c, &k.key, 0);
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
continue_at(cl, btree_node_write_done, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
} else {
|
2017-12-18 20:22:09 +08:00
|
|
|
/* No problem for multipage bvec since the bio is just allocated */
|
2013-03-24 07:11:31 +08:00
|
|
|
b->bio->bi_vcnt = 0;
|
2013-03-29 02:50:55 +08:00
|
|
|
bch_bio_map(b->bio, i);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
bch_submit_bbio(b->bio, b->c, &k.key, 0);
|
|
|
|
|
|
|
|
closure_sync(cl);
|
2013-12-17 07:27:25 +08:00
|
|
|
continue_at_nobarrier(cl, __btree_node_write_done, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
void __bch_btree_node_write(struct btree *b, struct closure *parent)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-12-18 15:49:49 +08:00
|
|
|
struct bset *i = btree_bset_last(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
lockdep_assert_held(&b->write_lock);
|
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_btree_write(b);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
BUG_ON(current->bio_list);
|
2013-04-26 04:58:35 +08:00
|
|
|
BUG_ON(b->written >= btree_blocks(b));
|
|
|
|
BUG_ON(b->written && !i->keys);
|
2013-12-18 15:49:49 +08:00
|
|
|
BUG_ON(btree_bset_first(b)->seq != i->seq);
|
2013-12-18 15:47:33 +08:00
|
|
|
bch_check_keys(&b->keys, "writing");
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
cancel_delayed_work(&b->work);
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
/* If caller isn't waiting for write, parent refcount is cache set */
|
2013-12-17 07:27:25 +08:00
|
|
|
down(&b->io_mutex);
|
|
|
|
closure_init(&b->io, parent ?: &b->c->cl);
|
2013-04-26 04:58:35 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
clear_bit(BTREE_NODE_dirty, &b->flags);
|
|
|
|
change_bit(BTREE_NODE_write_idx, &b->flags);
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
do_btree_node_write(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
atomic_long_add(set_blocks(i, block_bytes(b->c)) * b->c->sb.block_size,
|
2013-03-24 07:11:31 +08:00
|
|
|
&PTR_CACHE(b->c, &b->key, 0)->btree_sectors_written);
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
b->written += set_blocks(i, block_bytes(b->c));
|
2014-03-05 08:42:42 +08:00
|
|
|
}
|
2013-12-21 09:28:16 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
void bch_btree_node_write(struct btree *b, struct closure *parent)
|
|
|
|
{
|
|
|
|
unsigned nsets = b->keys.nsets;
|
|
|
|
|
|
|
|
lockdep_assert_held(&b->lock);
|
|
|
|
|
|
|
|
__bch_btree_node_write(b, parent);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 14:49:08 +08:00
|
|
|
/*
|
|
|
|
* do verify if there was more than one set initially (i.e. we did a
|
|
|
|
* sort) and we sorted down to a single set:
|
|
|
|
*/
|
2014-03-05 08:42:42 +08:00
|
|
|
if (nsets && !b->keys.nsets)
|
2013-12-18 14:49:08 +08:00
|
|
|
bch_btree_verify(b);
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
bch_btree_init_next(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-24 11:48:29 +08:00
|
|
|
static void bch_btree_node_write_sync(struct btree *b)
|
|
|
|
{
|
|
|
|
struct closure cl;
|
|
|
|
|
|
|
|
closure_init_stack(&cl);
|
2014-03-05 08:42:42 +08:00
|
|
|
|
|
|
|
mutex_lock(&b->write_lock);
|
2013-07-24 11:48:29 +08:00
|
|
|
bch_btree_node_write(b, &cl);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&b->write_lock);
|
|
|
|
|
2013-07-24 11:48:29 +08:00
|
|
|
closure_sync(&cl);
|
|
|
|
}
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
static void btree_node_write_work(struct work_struct *w)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct btree *b = container_of(to_delayed_work(w), struct btree, work);
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&b->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (btree_node_dirty(b))
|
2014-03-05 08:42:42 +08:00
|
|
|
__bch_btree_node_write(b, NULL);
|
|
|
|
mutex_unlock(&b->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-25 08:44:17 +08:00
|
|
|
static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-12-18 15:49:49 +08:00
|
|
|
struct bset *i = btree_bset_last(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
struct btree_write *w = btree_current_write(b);
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
lockdep_assert_held(&b->write_lock);
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
BUG_ON(!b->written);
|
|
|
|
BUG_ON(!i->keys);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
if (!btree_node_dirty(b))
|
2014-01-23 17:44:55 +08:00
|
|
|
schedule_delayed_work(&b->work, 30 * HZ);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
set_btree_node_dirty(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:44:17 +08:00
|
|
|
if (journal_ref) {
|
2013-03-24 07:11:31 +08:00
|
|
|
if (w->journal &&
|
2013-07-25 08:44:17 +08:00
|
|
|
journal_pin_cmp(b->c, w->journal, journal_ref)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
atomic_dec_bug(w->journal);
|
|
|
|
w->journal = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!w->journal) {
|
2013-07-25 08:44:17 +08:00
|
|
|
w->journal = journal_ref;
|
2013-03-24 07:11:31 +08:00
|
|
|
atomic_inc(w->journal);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Force write if set is too big */
|
2013-04-26 04:58:35 +08:00
|
|
|
if (set_bytes(i) > PAGE_SIZE - 48 &&
|
|
|
|
!current->bio_list)
|
|
|
|
bch_btree_node_write(b, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Btree in memory cache - allocation/freeing
|
|
|
|
* mca -> memory cache
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define mca_reserve(c) (((c->root && c->root->level) \
|
|
|
|
? c->root->level : 1) * 8 + 16)
|
|
|
|
#define mca_can_free(c) \
|
2014-03-18 08:15:53 +08:00
|
|
|
max_t(int, 0, c->btree_cache_used - mca_reserve(c))
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
static void mca_data_free(struct btree *b)
|
|
|
|
{
|
2013-12-17 07:27:25 +08:00
|
|
|
BUG_ON(b->io_mutex.count != 1);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_btree_keys_free(&b->keys);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
b->c->btree_cache_used--;
|
2013-12-18 15:49:49 +08:00
|
|
|
list_move(&b->list, &b->c->btree_cache_freed);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mca_bucket_free(struct btree *b)
|
|
|
|
{
|
|
|
|
BUG_ON(btree_node_dirty(b));
|
|
|
|
|
|
|
|
b->key.ptr[0] = 0;
|
|
|
|
hlist_del_init_rcu(&b->hash);
|
|
|
|
list_move(&b->list, &b->c->btree_cache_freeable);
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned btree_order(struct bkey *k)
|
|
|
|
{
|
|
|
|
return ilog2(KEY_SIZE(k) / PAGE_SECTORS ?: 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mca_data_alloc(struct btree *b, struct bkey *k, gfp_t gfp)
|
|
|
|
{
|
2013-12-21 09:28:16 +08:00
|
|
|
if (!bch_btree_keys_alloc(&b->keys,
|
2013-12-18 15:49:49 +08:00
|
|
|
max_t(unsigned,
|
|
|
|
ilog2(b->c->btree_pages),
|
|
|
|
btree_order(k)),
|
|
|
|
gfp)) {
|
2014-03-18 08:15:53 +08:00
|
|
|
b->c->btree_cache_used++;
|
2013-12-18 15:49:49 +08:00
|
|
|
list_move(&b->list, &b->c->btree_cache);
|
|
|
|
} else {
|
|
|
|
list_move(&b->list, &b->c->btree_cache_freed);
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct btree *mca_bucket_alloc(struct cache_set *c,
|
|
|
|
struct bkey *k, gfp_t gfp)
|
|
|
|
{
|
|
|
|
struct btree *b = kzalloc(sizeof(struct btree), gfp);
|
|
|
|
if (!b)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
init_rwsem(&b->lock);
|
|
|
|
lockdep_set_novalidate_class(&b->lock);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_init(&b->write_lock);
|
|
|
|
lockdep_set_novalidate_class(&b->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
INIT_LIST_HEAD(&b->list);
|
2013-04-26 04:58:35 +08:00
|
|
|
INIT_DELAYED_WORK(&b->work, btree_node_write_work);
|
2013-03-24 07:11:31 +08:00
|
|
|
b->c = c;
|
2013-12-17 07:27:25 +08:00
|
|
|
sema_init(&b->io_mutex, 1);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
mca_data_alloc(b, k, gfp);
|
|
|
|
return b;
|
|
|
|
}
|
|
|
|
|
2013-07-25 08:27:07 +08:00
|
|
|
static int mca_reap(struct btree *b, unsigned min_order, bool flush)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-07-25 08:27:07 +08:00
|
|
|
struct closure cl;
|
|
|
|
|
|
|
|
closure_init_stack(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
lockdep_assert_held(&b->c->bucket_lock);
|
|
|
|
|
|
|
|
if (!down_write_trylock(&b->lock))
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
BUG_ON(btree_node_dirty(b) && !b->keys.set[0].data);
|
2013-07-25 08:27:07 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
if (b->keys.page_order < min_order)
|
2013-12-17 07:27:25 +08:00
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
if (!flush) {
|
|
|
|
if (btree_node_dirty(b))
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
if (down_trylock(&b->io_mutex))
|
|
|
|
goto out_unlock;
|
|
|
|
up(&b->io_mutex);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&b->write_lock);
|
2013-07-24 11:48:29 +08:00
|
|
|
if (btree_node_dirty(b))
|
2014-03-05 08:42:42 +08:00
|
|
|
__bch_btree_node_write(b, &cl);
|
|
|
|
mutex_unlock(&b->write_lock);
|
|
|
|
|
|
|
|
closure_sync(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:27:07 +08:00
|
|
|
/* wait for any in flight btree write */
|
2013-12-17 07:27:25 +08:00
|
|
|
down(&b->io_mutex);
|
|
|
|
up(&b->io_mutex);
|
2013-07-25 08:27:07 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
return 0;
|
2013-12-17 07:27:25 +08:00
|
|
|
out_unlock:
|
|
|
|
rw_unlock(true, b);
|
|
|
|
return -ENOMEM;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
static unsigned long bch_mca_scan(struct shrinker *shrink,
|
|
|
|
struct shrink_control *sc)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct cache_set *c = container_of(shrink, struct cache_set, shrink);
|
|
|
|
struct btree *b, *t;
|
|
|
|
unsigned long i, nr = sc->nr_to_scan;
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
unsigned long freed = 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (c->shrinker_disabled)
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
return SHRINK_STOP;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
if (c->btree_cache_alloc_lock)
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
return SHRINK_STOP;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
/* Return -1 if we can't do anything right now */
|
2013-09-24 14:17:34 +08:00
|
|
|
if (sc->gfp_mask & __GFP_IO)
|
2013-03-24 07:11:31 +08:00
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
else if (!mutex_trylock(&c->bucket_lock))
|
|
|
|
return -1;
|
|
|
|
|
2013-06-04 04:04:56 +08:00
|
|
|
/*
|
|
|
|
* It's _really_ critical that we don't free too many btree nodes - we
|
|
|
|
* have to always leave ourselves a reserve. The reserve is how we
|
|
|
|
* guarantee that allocating memory for a new btree node can always
|
|
|
|
* succeed, so that inserting keys into the btree can always succeed and
|
|
|
|
* IO can always make forward progress:
|
|
|
|
*/
|
2013-03-24 07:11:31 +08:00
|
|
|
nr /= c->btree_pages;
|
|
|
|
nr = min_t(unsigned long, nr, mca_can_free(c));
|
|
|
|
|
|
|
|
i = 0;
|
|
|
|
list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) {
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
if (freed >= nr)
|
2013-03-24 07:11:31 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
if (++i > 3 &&
|
2013-07-25 08:27:07 +08:00
|
|
|
!mca_reap(b, 0, false)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
mca_data_free(b);
|
|
|
|
rw_unlock(true, b);
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
freed++;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
for (i = 0; (nr--) && i < c->btree_cache_used; i++) {
|
2013-12-11 05:24:26 +08:00
|
|
|
if (list_empty(&c->btree_cache))
|
|
|
|
goto out;
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
b = list_first_entry(&c->btree_cache, struct btree, list);
|
|
|
|
list_rotate_left(&c->btree_cache);
|
|
|
|
|
|
|
|
if (!b->accessed &&
|
2013-07-25 08:27:07 +08:00
|
|
|
!mca_reap(b, 0, false)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
mca_bucket_free(b);
|
|
|
|
mca_data_free(b);
|
|
|
|
rw_unlock(true, b);
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
freed++;
|
2013-03-24 07:11:31 +08:00
|
|
|
} else
|
|
|
|
b->accessed = 0;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
2018-03-19 08:36:21 +08:00
|
|
|
return freed * c->btree_pages;
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long bch_mca_count(struct shrinker *shrink,
|
|
|
|
struct shrink_control *sc)
|
|
|
|
{
|
|
|
|
struct cache_set *c = container_of(shrink, struct cache_set, shrink);
|
|
|
|
|
|
|
|
if (c->shrinker_disabled)
|
|
|
|
return 0;
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
if (c->btree_cache_alloc_lock)
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
return mca_can_free(c) * c->btree_pages;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void bch_btree_cache_free(struct cache_set *c)
|
|
|
|
{
|
|
|
|
struct btree *b;
|
|
|
|
struct closure cl;
|
|
|
|
closure_init_stack(&cl);
|
|
|
|
|
|
|
|
if (c->shrink.list.next)
|
|
|
|
unregister_shrinker(&c->shrink);
|
|
|
|
|
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
|
|
|
|
#ifdef CONFIG_BCACHE_DEBUG
|
|
|
|
if (c->verify_data)
|
|
|
|
list_move(&c->verify_data->list, &c->btree_cache);
|
2013-12-18 14:49:08 +08:00
|
|
|
|
|
|
|
free_pages((unsigned long) c->verify_ondisk, ilog2(bucket_pages(c)));
|
2013-03-24 07:11:31 +08:00
|
|
|
#endif
|
|
|
|
|
|
|
|
list_splice(&c->btree_cache_freeable,
|
|
|
|
&c->btree_cache);
|
|
|
|
|
|
|
|
while (!list_empty(&c->btree_cache)) {
|
|
|
|
b = list_first_entry(&c->btree_cache, struct btree, list);
|
|
|
|
|
|
|
|
if (btree_node_dirty(b))
|
|
|
|
btree_complete_write(b, btree_current_write(b));
|
|
|
|
clear_bit(BTREE_NODE_dirty, &b->flags);
|
|
|
|
|
|
|
|
mca_data_free(b);
|
|
|
|
}
|
|
|
|
|
|
|
|
while (!list_empty(&c->btree_cache_freed)) {
|
|
|
|
b = list_first_entry(&c->btree_cache_freed,
|
|
|
|
struct btree, list);
|
|
|
|
list_del(&b->list);
|
|
|
|
cancel_delayed_work_sync(&b->work);
|
|
|
|
kfree(b);
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
int bch_btree_cache_alloc(struct cache_set *c)
|
|
|
|
{
|
|
|
|
unsigned i;
|
|
|
|
|
|
|
|
for (i = 0; i < mca_reserve(c); i++)
|
2013-10-25 08:19:26 +08:00
|
|
|
if (!mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL))
|
|
|
|
return -ENOMEM;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
list_splice_init(&c->btree_cache,
|
|
|
|
&c->btree_cache_freeable);
|
|
|
|
|
|
|
|
#ifdef CONFIG_BCACHE_DEBUG
|
|
|
|
mutex_init(&c->verify_lock);
|
|
|
|
|
2013-12-18 14:49:08 +08:00
|
|
|
c->verify_ondisk = (void *)
|
|
|
|
__get_free_pages(GFP_KERNEL, ilog2(bucket_pages(c)));
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
c->verify_data = mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL);
|
|
|
|
|
|
|
|
if (c->verify_data &&
|
2013-12-21 09:28:16 +08:00
|
|
|
c->verify_data->keys.set->data)
|
2013-03-24 07:11:31 +08:00
|
|
|
list_del_init(&c->verify_data->list);
|
|
|
|
else
|
|
|
|
c->verify_data = NULL;
|
|
|
|
#endif
|
|
|
|
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
c->shrink.count_objects = bch_mca_count;
|
|
|
|
c->shrink.scan_objects = bch_mca_scan;
|
2013-03-24 07:11:31 +08:00
|
|
|
c->shrink.seeks = 4;
|
|
|
|
c->shrink.batch = c->btree_pages * 2;
|
2017-11-25 07:14:27 +08:00
|
|
|
|
|
|
|
if (register_shrinker(&c->shrink))
|
|
|
|
pr_warn("bcache: %s: could not register shrinker",
|
|
|
|
__func__);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Btree in memory cache - hash table */
|
|
|
|
|
|
|
|
static struct hlist_head *mca_hash(struct cache_set *c, struct bkey *k)
|
|
|
|
{
|
|
|
|
return &c->bucket_hash[hash_32(PTR_HASH(c, k), BUCKET_HASH_BITS)];
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct btree *mca_find(struct cache_set *c, struct bkey *k)
|
|
|
|
{
|
|
|
|
struct btree *b;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
hlist_for_each_entry_rcu(b, mca_hash(c, k), hash)
|
|
|
|
if (PTR_HASH(c, &b->key) == PTR_HASH(c, k))
|
|
|
|
goto out;
|
|
|
|
b = NULL;
|
|
|
|
out:
|
|
|
|
rcu_read_unlock();
|
|
|
|
return b;
|
|
|
|
}
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
static int mca_cannibalize_lock(struct cache_set *c, struct btree_op *op)
|
|
|
|
{
|
|
|
|
struct task_struct *old;
|
|
|
|
|
|
|
|
old = cmpxchg(&c->btree_cache_alloc_lock, NULL, current);
|
|
|
|
if (old && old != current) {
|
|
|
|
if (op)
|
|
|
|
prepare_to_wait(&c->btree_cache_wait, &op->wait,
|
|
|
|
TASK_UNINTERRUPTIBLE);
|
|
|
|
return -EINTR;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct btree *mca_cannibalize(struct cache_set *c, struct btree_op *op,
|
|
|
|
struct bkey *k)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-07-25 08:27:07 +08:00
|
|
|
struct btree *b;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_btree_cache_cannibalize(c);
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
if (mca_cannibalize_lock(c, op))
|
|
|
|
return ERR_PTR(-EINTR);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:27:07 +08:00
|
|
|
list_for_each_entry_reverse(b, &c->btree_cache, list)
|
|
|
|
if (!mca_reap(b, btree_order(k), false))
|
|
|
|
return b;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:27:07 +08:00
|
|
|
list_for_each_entry_reverse(b, &c->btree_cache, list)
|
|
|
|
if (!mca_reap(b, btree_order(k), true))
|
|
|
|
return b;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
WARN(1, "btree cache cannibalize failed\n");
|
2013-07-25 08:27:07 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We can only have one thread cannibalizing other cached btree nodes at a time,
|
|
|
|
* or we'll deadlock. We use an open coded mutex to ensure that, which a
|
|
|
|
* cannibalize_bucket() will take. This means every time we unlock the root of
|
|
|
|
* the btree, we need to release this lock if we have it held.
|
|
|
|
*/
|
2013-07-25 08:37:59 +08:00
|
|
|
static void bch_cannibalize_unlock(struct cache_set *c)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2014-03-18 08:15:53 +08:00
|
|
|
if (c->btree_cache_alloc_lock == current) {
|
|
|
|
c->btree_cache_alloc_lock = NULL;
|
|
|
|
wake_up(&c->btree_cache_wait);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
|
|
|
|
struct bkey *k, int level)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct btree *b;
|
|
|
|
|
2013-07-25 08:27:07 +08:00
|
|
|
BUG_ON(current->bio_list);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
lockdep_assert_held(&c->bucket_lock);
|
|
|
|
|
|
|
|
if (mca_find(c, k))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/* btree_free() doesn't free memory; it sticks the node on the end of
|
|
|
|
* the list. Check if there's any freed nodes there:
|
|
|
|
*/
|
|
|
|
list_for_each_entry(b, &c->btree_cache_freeable, list)
|
2013-07-25 08:27:07 +08:00
|
|
|
if (!mca_reap(b, btree_order(k), false))
|
2013-03-24 07:11:31 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* We never free struct btree itself, just the memory that holds the on
|
|
|
|
* disk node. Check the freed list before allocating a new one:
|
|
|
|
*/
|
|
|
|
list_for_each_entry(b, &c->btree_cache_freed, list)
|
2013-07-25 08:27:07 +08:00
|
|
|
if (!mca_reap(b, 0, false)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
mca_data_alloc(b, k, __GFP_NOWARN|GFP_NOIO);
|
2013-12-21 09:28:16 +08:00
|
|
|
if (!b->keys.set[0].data)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
else
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
b = mca_bucket_alloc(c, k, __GFP_NOWARN|GFP_NOIO);
|
|
|
|
if (!b)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
BUG_ON(!down_write_trylock(&b->lock));
|
2013-12-21 09:28:16 +08:00
|
|
|
if (!b->keys.set->data)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
out:
|
2013-12-17 07:27:25 +08:00
|
|
|
BUG_ON(b->io_mutex.count != 1);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
bkey_copy(&b->key, k);
|
|
|
|
list_move(&b->list, &c->btree_cache);
|
|
|
|
hlist_del_init_rcu(&b->hash);
|
|
|
|
hlist_add_head_rcu(&b->hash, mca_hash(c, k));
|
|
|
|
|
|
|
|
lock_set_subclass(&b->lock.dep_map, level + 1, _THIS_IP_);
|
2013-07-25 08:20:19 +08:00
|
|
|
b->parent = (void *) ~0UL;
|
2013-12-21 09:28:16 +08:00
|
|
|
b->flags = 0;
|
|
|
|
b->written = 0;
|
|
|
|
b->level = level;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-21 09:22:05 +08:00
|
|
|
if (!b->level)
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_btree_keys_init(&b->keys, &bch_extent_keys_ops,
|
|
|
|
&b->c->expensive_debug_checks);
|
2013-12-21 09:22:05 +08:00
|
|
|
else
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_btree_keys_init(&b->keys, &bch_btree_keys_ops,
|
|
|
|
&b->c->expensive_debug_checks);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return b;
|
|
|
|
err:
|
|
|
|
if (b)
|
|
|
|
rw_unlock(true, b);
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
b = mca_cannibalize(c, op, k);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (!IS_ERR(b))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
return b;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* bch_btree_node_get - find a btree node in the cache and lock it, reading it
|
|
|
|
* in from disk if necessary.
|
|
|
|
*
|
2013-07-25 09:04:18 +08:00
|
|
|
* If IO is necessary and running under generic_make_request, returns -EAGAIN.
|
2013-03-24 07:11:31 +08:00
|
|
|
*
|
|
|
|
* The btree node will have either a read or a write lock held, depending on
|
|
|
|
* level and op->lock.
|
|
|
|
*/
|
2014-03-18 08:15:53 +08:00
|
|
|
struct btree *bch_btree_node_get(struct cache_set *c, struct btree_op *op,
|
2014-07-12 15:22:53 +08:00
|
|
|
struct bkey *k, int level, bool write,
|
|
|
|
struct btree *parent)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
int i = 0;
|
|
|
|
struct btree *b;
|
|
|
|
|
|
|
|
BUG_ON(level < 0);
|
|
|
|
retry:
|
|
|
|
b = mca_find(c, k);
|
|
|
|
|
|
|
|
if (!b) {
|
2013-04-26 04:58:35 +08:00
|
|
|
if (current->bio_list)
|
|
|
|
return ERR_PTR(-EAGAIN);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
mutex_lock(&c->bucket_lock);
|
2014-03-18 08:15:53 +08:00
|
|
|
b = mca_alloc(c, op, k, level);
|
2013-03-24 07:11:31 +08:00
|
|
|
mutex_unlock(&c->bucket_lock);
|
|
|
|
|
|
|
|
if (!b)
|
|
|
|
goto retry;
|
|
|
|
if (IS_ERR(b))
|
|
|
|
return b;
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
bch_btree_node_read(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (!write)
|
|
|
|
downgrade_write(&b->lock);
|
|
|
|
} else {
|
|
|
|
rw_lock(write, b, level);
|
|
|
|
if (PTR_HASH(c, &b->key) != PTR_HASH(c, k)) {
|
|
|
|
rw_unlock(write, b);
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
BUG_ON(b->level != level);
|
|
|
|
}
|
|
|
|
|
2014-07-12 15:22:53 +08:00
|
|
|
b->parent = parent;
|
2013-03-24 07:11:31 +08:00
|
|
|
b->accessed = 1;
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
for (; i <= b->keys.nsets && b->keys.set[i].size; i++) {
|
|
|
|
prefetch(b->keys.set[i].tree);
|
|
|
|
prefetch(b->keys.set[i].data);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
for (; i <= b->keys.nsets; i++)
|
|
|
|
prefetch(b->keys.set[i].data);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
if (btree_node_io_error(b)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
rw_unlock(write, b);
|
2013-04-26 04:58:35 +08:00
|
|
|
return ERR_PTR(-EIO);
|
|
|
|
}
|
|
|
|
|
|
|
|
BUG_ON(!b->written);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return b;
|
|
|
|
}
|
|
|
|
|
2014-07-12 15:22:53 +08:00
|
|
|
static void btree_node_prefetch(struct btree *parent, struct bkey *k)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct btree *b;
|
|
|
|
|
2014-07-12 15:22:53 +08:00
|
|
|
mutex_lock(&parent->c->bucket_lock);
|
|
|
|
b = mca_alloc(parent->c, NULL, k, parent->level - 1);
|
|
|
|
mutex_unlock(&parent->c->bucket_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (!IS_ERR_OR_NULL(b)) {
|
2014-07-12 15:22:53 +08:00
|
|
|
b->parent = parent;
|
2013-04-26 04:58:35 +08:00
|
|
|
bch_btree_node_read(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
rw_unlock(true, b);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Btree alloc */
|
|
|
|
|
2013-07-25 08:27:07 +08:00
|
|
|
static void btree_node_free(struct btree *b)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_btree_node_free(b);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
BUG_ON(b == b->c->root);
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&b->write_lock);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
if (btree_node_dirty(b))
|
|
|
|
btree_complete_write(b, btree_current_write(b));
|
|
|
|
clear_bit(BTREE_NODE_dirty, &b->flags);
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&b->write_lock);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
cancel_delayed_work(&b->work);
|
|
|
|
|
|
|
|
mutex_lock(&b->c->bucket_lock);
|
|
|
|
bch_bucket_free(b->c, &b->key);
|
|
|
|
mca_bucket_free(b);
|
|
|
|
mutex_unlock(&b->c->bucket_lock);
|
|
|
|
}
|
|
|
|
|
2014-04-22 09:23:12 +08:00
|
|
|
struct btree *__bch_btree_node_alloc(struct cache_set *c, struct btree_op *op,
|
2014-07-12 15:22:53 +08:00
|
|
|
int level, bool wait,
|
|
|
|
struct btree *parent)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
BKEY_PADDED(key) k;
|
|
|
|
struct btree *b = ERR_PTR(-EAGAIN);
|
|
|
|
|
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
retry:
|
2014-04-22 09:23:12 +08:00
|
|
|
if (__bch_bucket_alloc_set(c, RESERVE_BTREE, &k.key, 1, wait))
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
2013-07-25 07:46:42 +08:00
|
|
|
bkey_put(c, &k.key);
|
2013-03-24 07:11:31 +08:00
|
|
|
SET_KEY_SIZE(&k.key, c->btree_pages * PAGE_SECTORS);
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
b = mca_alloc(c, op, &k.key, level);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (IS_ERR(b))
|
|
|
|
goto err_free;
|
|
|
|
|
|
|
|
if (!b) {
|
2013-03-26 02:46:44 +08:00
|
|
|
cache_bug(c,
|
|
|
|
"Tried to allocate bucket that was in btree cache");
|
2013-03-24 07:11:31 +08:00
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
|
|
|
|
b->accessed = 1;
|
2014-07-12 15:22:53 +08:00
|
|
|
b->parent = parent;
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_bset_init_next(&b->keys, b->keys.set->data, bset_magic(&b->c->sb));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
2013-04-27 06:39:55 +08:00
|
|
|
|
|
|
|
trace_bcache_btree_node_alloc(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
return b;
|
|
|
|
err_free:
|
|
|
|
bch_bucket_free(c, &k.key);
|
|
|
|
err:
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
2013-04-27 06:39:55 +08:00
|
|
|
|
2014-05-24 02:18:35 +08:00
|
|
|
trace_bcache_btree_node_alloc_fail(c);
|
2013-03-24 07:11:31 +08:00
|
|
|
return b;
|
|
|
|
}
|
|
|
|
|
2014-04-22 09:23:12 +08:00
|
|
|
static struct btree *bch_btree_node_alloc(struct cache_set *c,
|
2014-07-12 15:22:53 +08:00
|
|
|
struct btree_op *op, int level,
|
|
|
|
struct btree *parent)
|
2014-04-22 09:23:12 +08:00
|
|
|
{
|
2014-07-12 15:22:53 +08:00
|
|
|
return __bch_btree_node_alloc(c, op, level, op != NULL, parent);
|
2014-04-22 09:23:12 +08:00
|
|
|
}
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
static struct btree *btree_node_alloc_replacement(struct btree *b,
|
|
|
|
struct btree_op *op)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2014-07-12 15:22:53 +08:00
|
|
|
struct btree *n = bch_btree_node_alloc(b->c, op, b->level, b->parent);
|
2013-09-11 13:53:34 +08:00
|
|
|
if (!IS_ERR_OR_NULL(n)) {
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&n->write_lock);
|
2013-11-12 10:38:51 +08:00
|
|
|
bch_btree_sort_into(&b->keys, &n->keys, &b->c->sort);
|
2013-09-11 13:53:34 +08:00
|
|
|
bkey_copy_key(&n->key, &b->key);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&n->write_lock);
|
2013-09-11 13:53:34 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return n;
|
|
|
|
}
|
|
|
|
|
2013-07-25 14:18:05 +08:00
|
|
|
static void make_btree_freeing_key(struct btree *b, struct bkey *k)
|
|
|
|
{
|
|
|
|
unsigned i;
|
|
|
|
|
2014-03-18 09:22:34 +08:00
|
|
|
mutex_lock(&b->c->bucket_lock);
|
|
|
|
|
|
|
|
atomic_inc(&b->c->prio_blocked);
|
|
|
|
|
2013-07-25 14:18:05 +08:00
|
|
|
bkey_copy(k, &b->key);
|
|
|
|
bkey_copy_key(k, &ZERO_KEY);
|
|
|
|
|
2014-03-18 09:22:34 +08:00
|
|
|
for (i = 0; i < KEY_PTRS(k); i++)
|
|
|
|
SET_PTR_GEN(k, i,
|
|
|
|
bch_inc_gen(PTR_CACHE(b->c, &b->key, i),
|
|
|
|
PTR_BUCKET(b->c, &b->key, i)));
|
2013-07-25 14:18:05 +08:00
|
|
|
|
2014-03-18 09:22:34 +08:00
|
|
|
mutex_unlock(&b->c->bucket_lock);
|
2013-07-25 14:18:05 +08:00
|
|
|
}
|
|
|
|
|
2013-12-17 17:29:34 +08:00
|
|
|
static int btree_check_reserve(struct btree *b, struct btree_op *op)
|
|
|
|
{
|
|
|
|
struct cache_set *c = b->c;
|
|
|
|
struct cache *ca;
|
2014-03-18 08:15:53 +08:00
|
|
|
unsigned i, reserve = (c->root->level - b->level) * 2 + 1;
|
2013-12-17 17:29:34 +08:00
|
|
|
|
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
|
|
|
|
for_each_cache(ca, c, i)
|
|
|
|
if (fifo_used(&ca->free[RESERVE_BTREE]) < reserve) {
|
|
|
|
if (op)
|
2014-03-18 08:15:53 +08:00
|
|
|
prepare_to_wait(&c->btree_cache_wait, &op->wait,
|
2013-12-17 17:29:34 +08:00
|
|
|
TASK_UNINTERRUPTIBLE);
|
2014-03-18 08:15:53 +08:00
|
|
|
mutex_unlock(&c->bucket_lock);
|
|
|
|
return -EINTR;
|
2013-12-17 17:29:34 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
2014-03-18 08:15:53 +08:00
|
|
|
|
|
|
|
return mca_cannibalize_lock(b->c, op);
|
2013-12-17 17:29:34 +08:00
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Garbage collection */
|
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
static uint8_t __bch_btree_mark_key(struct cache_set *c, int level,
|
|
|
|
struct bkey *k)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
uint8_t stale = 0;
|
|
|
|
unsigned i;
|
|
|
|
struct bucket *g;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ptr_invalid() can't return true for the keys that mark btree nodes as
|
|
|
|
* freed, but since ptr_bad() returns true we'll never actually use them
|
|
|
|
* for anything and thus we don't want mark their pointers here
|
|
|
|
*/
|
|
|
|
if (!bkey_cmp(k, &ZERO_KEY))
|
|
|
|
return stale;
|
|
|
|
|
|
|
|
for (i = 0; i < KEY_PTRS(k); i++) {
|
|
|
|
if (!ptr_available(c, k, i))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
g = PTR_BUCKET(c, k, i);
|
|
|
|
|
2014-02-28 09:51:12 +08:00
|
|
|
if (gen_after(g->last_gc, PTR_GEN(k, i)))
|
|
|
|
g->last_gc = PTR_GEN(k, i);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (ptr_stale(c, k, i)) {
|
|
|
|
stale = max(stale, ptr_stale(c, k, i));
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
cache_bug_on(GC_MARK(g) &&
|
|
|
|
(GC_MARK(g) == GC_MARK_METADATA) != (level != 0),
|
|
|
|
c, "inconsistent ptrs: mark = %llu, level = %i",
|
|
|
|
GC_MARK(g), level);
|
|
|
|
|
|
|
|
if (level)
|
|
|
|
SET_GC_MARK(g, GC_MARK_METADATA);
|
|
|
|
else if (KEY_DIRTY(k))
|
|
|
|
SET_GC_MARK(g, GC_MARK_DIRTY);
|
2014-03-14 04:46:29 +08:00
|
|
|
else if (!GC_MARK(g))
|
|
|
|
SET_GC_MARK(g, GC_MARK_RECLAIMABLE);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
/* guard against overflow */
|
|
|
|
SET_GC_SECTORS_USED(g, min_t(unsigned,
|
|
|
|
GC_SECTORS_USED(g) + KEY_SIZE(k),
|
bcache: fix BUG_ON due to integer overflow with GC_SECTORS_USED
The BUG_ON at the end of __bch_btree_mark_key can be triggered due to
an integer overflow error:
BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 13);
...
SET_GC_SECTORS_USED(g, min_t(unsigned,
GC_SECTORS_USED(g) + KEY_SIZE(k),
(1 << 14) - 1));
BUG_ON(!GC_SECTORS_USED(g));
In bcache.h, the SECTORS_USED bitfield is defined to be 13 bits wide.
While the SET_ code tries to ensure that the field doesn't overflow by
clamping it to (1<<14)-1 == 16383, this is incorrect because 16383
requires 14 bits. Therefore, if GC_SECTORS_USED() + KEY_SIZE() =
8192, the SET_ statement tries to store 8192 into a 13-bit field. In
a 13-bit field, 8192 becomes zero, thus triggering the BUG_ON.
Therefore, create a field width constant and a max value constant, and
use those to create the bitfield and check the inputs to
SET_GC_SECTORS_USED. Arguably the BITMASK() template ought to have
BUG_ON checks for too-large values, but that's a separate patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2014-01-29 08:57:39 +08:00
|
|
|
MAX_GC_SECTORS_USED));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
BUG_ON(!GC_SECTORS_USED(g));
|
|
|
|
}
|
|
|
|
|
|
|
|
return stale;
|
|
|
|
}
|
|
|
|
|
|
|
|
#define btree_mark_key(b, k) __bch_btree_mark_key(b->c, b->level, k)
|
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
void bch_initial_mark_key(struct cache_set *c, int level, struct bkey *k)
|
|
|
|
{
|
|
|
|
unsigned i;
|
|
|
|
|
|
|
|
for (i = 0; i < KEY_PTRS(k); i++)
|
|
|
|
if (ptr_available(c, k, i) &&
|
|
|
|
!ptr_stale(c, k, i)) {
|
|
|
|
struct bucket *b = PTR_BUCKET(c, k, i);
|
|
|
|
|
|
|
|
b->gen = PTR_GEN(k, i);
|
|
|
|
|
|
|
|
if (level && bkey_cmp(k, &ZERO_KEY))
|
|
|
|
b->prio = BTREE_PRIO;
|
|
|
|
else if (!level && b->prio == BTREE_PRIO)
|
|
|
|
b->prio = INITIAL_PRIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
__bch_btree_mark_key(c, level, k);
|
|
|
|
}
|
|
|
|
|
2017-10-31 05:46:33 +08:00
|
|
|
void bch_update_bucket_in_use(struct cache_set *c, struct gc_stat *stats)
|
|
|
|
{
|
|
|
|
stats->in_use = (c->nbuckets - c->avail_nbuckets) * 100 / c->nbuckets;
|
|
|
|
}
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
static bool btree_gc_mark_node(struct btree *b, struct gc_stat *gc)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
uint8_t stale = 0;
|
2013-09-11 10:07:00 +08:00
|
|
|
unsigned keys = 0, good_keys = 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
struct bkey *k;
|
|
|
|
struct btree_iter iter;
|
|
|
|
struct bset_tree *t;
|
|
|
|
|
|
|
|
gc->nodes++;
|
|
|
|
|
2013-11-12 09:35:24 +08:00
|
|
|
for_each_key_filter(&b->keys, k, &iter, bch_ptr_invalid) {
|
2013-03-24 07:11:31 +08:00
|
|
|
stale = max(stale, btree_mark_key(b, k));
|
2013-09-11 10:07:00 +08:00
|
|
|
keys++;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
if (bch_ptr_bad(&b->keys, k))
|
2013-03-24 07:11:31 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
gc->key_bytes += bkey_u64s(k);
|
|
|
|
gc->nkeys++;
|
2013-09-11 10:07:00 +08:00
|
|
|
good_keys++;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
gc->data += KEY_SIZE(k);
|
|
|
|
}
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
for (t = b->keys.set; t <= &b->keys.set[b->keys.nsets]; t++)
|
2013-03-24 07:11:31 +08:00
|
|
|
btree_bug_on(t->size &&
|
2013-12-21 09:28:16 +08:00
|
|
|
bset_written(&b->keys, t) &&
|
2013-03-24 07:11:31 +08:00
|
|
|
bkey_cmp(&b->key, &t->end) < 0,
|
|
|
|
b, "found short btree key in gc");
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (b->c->gc_always_rewrite)
|
|
|
|
return true;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (stale > 10)
|
|
|
|
return true;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if ((keys - good_keys) * 2 > keys)
|
|
|
|
return true;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
return false;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
#define GC_MERGE_NODES 4U
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
struct gc_merge_info {
|
|
|
|
struct btree *b;
|
|
|
|
unsigned keys;
|
|
|
|
};
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
static int bch_btree_insert_node(struct btree *, struct btree_op *,
|
|
|
|
struct keylist *, atomic_t *, struct bkey *);
|
|
|
|
|
|
|
|
static int btree_gc_coalesce(struct btree *b, struct btree_op *op,
|
2014-03-18 08:15:53 +08:00
|
|
|
struct gc_stat *gc, struct gc_merge_info *r)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-09-11 10:07:00 +08:00
|
|
|
unsigned i, nodes = 0, keys = 0, blocks;
|
|
|
|
struct btree *new_nodes[GC_MERGE_NODES];
|
2014-03-18 08:15:53 +08:00
|
|
|
struct keylist keylist;
|
2013-07-25 09:04:18 +08:00
|
|
|
struct closure cl;
|
2013-09-11 10:07:00 +08:00
|
|
|
struct bkey *k;
|
2013-07-25 09:04:18 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
bch_keylist_init(&keylist);
|
|
|
|
|
|
|
|
if (btree_check_reserve(b, NULL))
|
|
|
|
return 0;
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
memset(new_nodes, 0, sizeof(new_nodes));
|
2013-07-25 09:04:18 +08:00
|
|
|
closure_init_stack(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
while (nodes < GC_MERGE_NODES && !IS_ERR_OR_NULL(r[nodes].b))
|
2013-03-24 07:11:31 +08:00
|
|
|
keys += r[nodes++].keys;
|
|
|
|
|
|
|
|
blocks = btree_default_blocks(b->c) * 2 / 3;
|
|
|
|
|
|
|
|
if (nodes < 2 ||
|
2013-12-21 09:28:16 +08:00
|
|
|
__set_blocks(b->keys.set[0].data, keys,
|
2013-12-18 15:49:49 +08:00
|
|
|
block_bytes(b->c)) > blocks * (nodes - 1))
|
2013-09-11 10:07:00 +08:00
|
|
|
return 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
for (i = 0; i < nodes; i++) {
|
2014-03-18 08:15:53 +08:00
|
|
|
new_nodes[i] = btree_node_alloc_replacement(r[i].b, NULL);
|
2013-09-11 10:07:00 +08:00
|
|
|
if (IS_ERR_OR_NULL(new_nodes[i]))
|
|
|
|
goto out_nocoalesce;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
/*
|
|
|
|
* We have to check the reserve here, after we've allocated our new
|
|
|
|
* nodes, to make sure the insert below will succeed - we also check
|
|
|
|
* before as an optimization to potentially avoid a bunch of expensive
|
|
|
|
* allocs/sorts
|
|
|
|
*/
|
|
|
|
if (btree_check_reserve(b, NULL))
|
|
|
|
goto out_nocoalesce;
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
for (i = 0; i < nodes; i++)
|
|
|
|
mutex_lock(&new_nodes[i]->write_lock);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
for (i = nodes - 1; i > 0; --i) {
|
2013-12-18 15:49:49 +08:00
|
|
|
struct bset *n1 = btree_bset_first(new_nodes[i]);
|
|
|
|
struct bset *n2 = btree_bset_first(new_nodes[i - 1]);
|
2013-03-24 07:11:31 +08:00
|
|
|
struct bkey *k, *last = NULL;
|
|
|
|
|
|
|
|
keys = 0;
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (i > 1) {
|
|
|
|
for (k = n2->start;
|
2013-12-18 13:56:21 +08:00
|
|
|
k < bset_bkey_last(n2);
|
2013-09-11 10:07:00 +08:00
|
|
|
k = bkey_next(k)) {
|
|
|
|
if (__set_blocks(n1, n1->keys + keys +
|
2013-12-18 15:49:49 +08:00
|
|
|
bkey_u64s(k),
|
|
|
|
block_bytes(b->c)) > blocks)
|
2013-09-11 10:07:00 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
last = k;
|
|
|
|
keys += bkey_u64s(k);
|
|
|
|
}
|
|
|
|
} else {
|
2013-03-24 07:11:31 +08:00
|
|
|
/*
|
|
|
|
* Last node we're not getting rid of - we're getting
|
|
|
|
* rid of the node at r[0]. Have to try and fit all of
|
|
|
|
* the remaining keys into this node; we can't ensure
|
|
|
|
* they will always fit due to rounding and variable
|
|
|
|
* length keys (shouldn't be possible in practice,
|
|
|
|
* though)
|
|
|
|
*/
|
2013-09-11 10:07:00 +08:00
|
|
|
if (__set_blocks(n1, n1->keys + n2->keys,
|
2013-12-18 15:49:49 +08:00
|
|
|
block_bytes(b->c)) >
|
|
|
|
btree_blocks(new_nodes[i]))
|
2013-09-11 10:07:00 +08:00
|
|
|
goto out_nocoalesce;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
keys = n2->keys;
|
2013-09-11 10:07:00 +08:00
|
|
|
/* Take the key of the node we're getting rid of */
|
2013-03-24 07:11:31 +08:00
|
|
|
last = &r->b->key;
|
2013-09-11 10:07:00 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
BUG_ON(__set_blocks(n1, n1->keys + keys, block_bytes(b->c)) >
|
|
|
|
btree_blocks(new_nodes[i]));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (last)
|
|
|
|
bkey_copy_key(&new_nodes[i]->key, last);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 13:56:21 +08:00
|
|
|
memcpy(bset_bkey_last(n1),
|
2013-03-24 07:11:31 +08:00
|
|
|
n2->start,
|
2013-12-18 13:56:21 +08:00
|
|
|
(void *) bset_bkey_idx(n2, keys) - (void *) n2->start);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
n1->keys += keys;
|
2013-09-11 10:07:00 +08:00
|
|
|
r[i].keys = n1->keys;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
memmove(n2->start,
|
2013-12-18 13:56:21 +08:00
|
|
|
bset_bkey_idx(n2, keys),
|
|
|
|
(void *) bset_bkey_last(n2) -
|
|
|
|
(void *) bset_bkey_idx(n2, keys));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
n2->keys -= keys;
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
if (__bch_keylist_realloc(&keylist,
|
2013-11-12 10:20:51 +08:00
|
|
|
bkey_u64s(&new_nodes[i]->key)))
|
2013-09-11 10:07:00 +08:00
|
|
|
goto out_nocoalesce;
|
|
|
|
|
|
|
|
bch_btree_node_write(new_nodes[i], &cl);
|
2014-03-18 08:15:53 +08:00
|
|
|
bch_keylist_add(&keylist, &new_nodes[i]->key);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
for (i = 0; i < nodes; i++)
|
|
|
|
mutex_unlock(&new_nodes[i]->write_lock);
|
|
|
|
|
2014-03-18 09:22:34 +08:00
|
|
|
closure_sync(&cl);
|
|
|
|
|
|
|
|
/* We emptied out this node */
|
|
|
|
BUG_ON(btree_bset_first(new_nodes[0])->keys);
|
|
|
|
btree_node_free(new_nodes[0]);
|
|
|
|
rw_unlock(true, new_nodes[0]);
|
2014-07-13 12:53:11 +08:00
|
|
|
new_nodes[0] = NULL;
|
2014-03-18 09:22:34 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
for (i = 0; i < nodes; i++) {
|
2014-03-18 08:15:53 +08:00
|
|
|
if (__bch_keylist_realloc(&keylist, bkey_u64s(&r[i].b->key)))
|
2013-09-11 10:07:00 +08:00
|
|
|
goto out_nocoalesce;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
make_btree_freeing_key(r[i].b, keylist.top);
|
|
|
|
bch_keylist_push(&keylist);
|
2013-09-11 10:07:00 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
bch_btree_insert_node(b, op, &keylist, NULL, NULL);
|
|
|
|
BUG_ON(!bch_keylist_empty(&keylist));
|
2013-09-11 10:07:00 +08:00
|
|
|
|
|
|
|
for (i = 0; i < nodes; i++) {
|
|
|
|
btree_node_free(r[i].b);
|
|
|
|
rw_unlock(true, r[i].b);
|
|
|
|
|
|
|
|
r[i].b = new_nodes[i];
|
|
|
|
}
|
|
|
|
|
|
|
|
memmove(r, r + 1, sizeof(r[0]) * (nodes - 1));
|
|
|
|
r[nodes - 1].b = ERR_PTR(-EINTR);
|
|
|
|
|
|
|
|
trace_bcache_btree_gc_coalesce(nodes);
|
2013-03-24 07:11:31 +08:00
|
|
|
gc->nodes--;
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
bch_keylist_free(&keylist);
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
/* Invalidated our iterator */
|
|
|
|
return -EINTR;
|
|
|
|
|
|
|
|
out_nocoalesce:
|
|
|
|
closure_sync(&cl);
|
2014-03-18 08:15:53 +08:00
|
|
|
bch_keylist_free(&keylist);
|
2013-09-11 10:07:00 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
while ((k = bch_keylist_pop(&keylist)))
|
2013-09-11 10:07:00 +08:00
|
|
|
if (!bkey_cmp(k, &ZERO_KEY))
|
|
|
|
atomic_dec(&b->c->prio_blocked);
|
|
|
|
|
|
|
|
for (i = 0; i < nodes; i++)
|
|
|
|
if (!IS_ERR_OR_NULL(new_nodes[i])) {
|
|
|
|
btree_node_free(new_nodes[i]);
|
|
|
|
rw_unlock(true, new_nodes[i]);
|
|
|
|
}
|
|
|
|
return 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
static int btree_gc_rewrite_node(struct btree *b, struct btree_op *op,
|
|
|
|
struct btree *replace)
|
|
|
|
{
|
|
|
|
struct keylist keys;
|
|
|
|
struct btree *n;
|
|
|
|
|
|
|
|
if (btree_check_reserve(b, NULL))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
n = btree_node_alloc_replacement(replace, NULL);
|
|
|
|
|
|
|
|
/* recheck reserve after allocating replacement node */
|
|
|
|
if (btree_check_reserve(b, NULL)) {
|
|
|
|
btree_node_free(n);
|
|
|
|
rw_unlock(true, n);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
bch_btree_node_write_sync(n);
|
|
|
|
|
|
|
|
bch_keylist_init(&keys);
|
|
|
|
bch_keylist_add(&keys, &n->key);
|
|
|
|
|
|
|
|
make_btree_freeing_key(replace, keys.top);
|
|
|
|
bch_keylist_push(&keys);
|
|
|
|
|
|
|
|
bch_btree_insert_node(b, op, &keys, NULL, NULL);
|
|
|
|
BUG_ON(!bch_keylist_empty(&keys));
|
|
|
|
|
|
|
|
btree_node_free(replace);
|
|
|
|
rw_unlock(true, n);
|
|
|
|
|
|
|
|
/* Invalidated our iterator */
|
|
|
|
return -EINTR;
|
|
|
|
}
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
static unsigned btree_gc_count_keys(struct btree *b)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-09-11 10:07:00 +08:00
|
|
|
struct bkey *k;
|
|
|
|
struct btree_iter iter;
|
|
|
|
unsigned ret = 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-11-12 09:35:24 +08:00
|
|
|
for_each_key_filter(&b->keys, k, &iter, bch_ptr_bad)
|
2013-09-11 10:07:00 +08:00
|
|
|
ret += bkey_u64s(k);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
static int btree_gc_recurse(struct btree *b, struct btree_op *op,
|
|
|
|
struct closure *writes, struct gc_stat *gc)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
bool should_rewrite;
|
|
|
|
struct bkey *k;
|
|
|
|
struct btree_iter iter;
|
2013-03-24 07:11:31 +08:00
|
|
|
struct gc_merge_info r[GC_MERGE_NODES];
|
2014-03-05 08:42:42 +08:00
|
|
|
struct gc_merge_info *i, *last = r + ARRAY_SIZE(r) - 1;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-11-12 09:35:24 +08:00
|
|
|
bch_btree_iter_init(&b->keys, &iter, &b->c->gc_done);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
for (i = r; i < r + ARRAY_SIZE(r); i++)
|
|
|
|
i->b = ERR_PTR(-EINTR);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
while (1) {
|
2013-12-21 09:28:16 +08:00
|
|
|
k = bch_btree_iter_next_filter(&iter, &b->keys, bch_ptr_bad);
|
2013-09-11 10:07:00 +08:00
|
|
|
if (k) {
|
2014-03-18 08:15:53 +08:00
|
|
|
r->b = bch_btree_node_get(b->c, op, k, b->level - 1,
|
2014-07-12 15:22:53 +08:00
|
|
|
true, b);
|
2013-09-11 10:07:00 +08:00
|
|
|
if (IS_ERR(r->b)) {
|
|
|
|
ret = PTR_ERR(r->b);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
r->keys = btree_gc_count_keys(r->b);
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
ret = btree_gc_coalesce(b, op, gc, r);
|
2013-09-11 10:07:00 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (!last->b)
|
|
|
|
break;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (!IS_ERR(last->b)) {
|
|
|
|
should_rewrite = btree_gc_mark_node(last->b, gc);
|
2014-03-18 08:15:53 +08:00
|
|
|
if (should_rewrite) {
|
|
|
|
ret = btree_gc_rewrite_node(b, op, last->b);
|
|
|
|
if (ret)
|
2013-09-11 10:07:00 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (last->b->level) {
|
|
|
|
ret = btree_gc_recurse(last->b, op, writes, gc);
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
bkey_copy_key(&b->c->gc_done, &last->b->key);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Must flush leaf nodes before gc ends, since replace
|
|
|
|
* operations aren't journalled
|
|
|
|
*/
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&last->b->write_lock);
|
2013-09-11 10:07:00 +08:00
|
|
|
if (btree_node_dirty(last->b))
|
|
|
|
bch_btree_node_write(last->b, writes);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&last->b->write_lock);
|
2013-09-11 10:07:00 +08:00
|
|
|
rw_unlock(true, last->b);
|
|
|
|
}
|
|
|
|
|
|
|
|
memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
|
|
|
|
r->b = NULL;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (need_resched()) {
|
|
|
|
ret = -EAGAIN;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
for (i = r; i < r + ARRAY_SIZE(r); i++)
|
|
|
|
if (!IS_ERR_OR_NULL(i->b)) {
|
|
|
|
mutex_lock(&i->b->write_lock);
|
|
|
|
if (btree_node_dirty(i->b))
|
|
|
|
bch_btree_node_write(i->b, writes);
|
|
|
|
mutex_unlock(&i->b->write_lock);
|
|
|
|
rw_unlock(true, i->b);
|
2013-09-11 10:07:00 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int bch_btree_gc_root(struct btree *b, struct btree_op *op,
|
|
|
|
struct closure *writes, struct gc_stat *gc)
|
|
|
|
{
|
|
|
|
struct btree *n = NULL;
|
2013-09-11 10:07:00 +08:00
|
|
|
int ret = 0;
|
|
|
|
bool should_rewrite;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
should_rewrite = btree_gc_mark_node(b, gc);
|
|
|
|
if (should_rewrite) {
|
2014-03-18 08:15:53 +08:00
|
|
|
n = btree_node_alloc_replacement(b, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (!IS_ERR_OR_NULL(n)) {
|
|
|
|
bch_btree_node_write_sync(n);
|
2014-03-05 08:42:42 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
bch_btree_set_root(n);
|
|
|
|
btree_node_free(b);
|
|
|
|
rw_unlock(true, n);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
return -EINTR;
|
|
|
|
}
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
__bch_btree_mark_key(b->c, b->level + 1, &b->key);
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (b->level) {
|
|
|
|
ret = btree_gc_recurse(b, op, writes, gc);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
bkey_copy_key(&b->c->gc_done, &b->key);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void btree_gc_start(struct cache_set *c)
|
|
|
|
{
|
|
|
|
struct cache *ca;
|
|
|
|
struct bucket *b;
|
|
|
|
unsigned i;
|
|
|
|
|
|
|
|
if (!c->gc_mark_valid)
|
|
|
|
return;
|
|
|
|
|
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
|
|
|
|
c->gc_mark_valid = 0;
|
|
|
|
c->gc_done = ZERO_KEY;
|
|
|
|
|
|
|
|
for_each_cache(ca, c, i)
|
|
|
|
for_each_bucket(b, ca) {
|
2014-02-28 09:51:12 +08:00
|
|
|
b->last_gc = b->gen;
|
2013-07-12 10:43:21 +08:00
|
|
|
if (!atomic_read(&b->pin)) {
|
2014-03-14 04:46:29 +08:00
|
|
|
SET_GC_MARK(b, 0);
|
2013-07-12 10:43:21 +08:00
|
|
|
SET_GC_SECTORS_USED(b, 0);
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
|
|
|
}
|
|
|
|
|
2017-10-31 05:46:33 +08:00
|
|
|
static void bch_btree_gc_finish(struct cache_set *c)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct bucket *b;
|
|
|
|
struct cache *ca;
|
|
|
|
unsigned i;
|
|
|
|
|
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
|
|
|
|
set_gc_sectors(c);
|
|
|
|
c->gc_mark_valid = 1;
|
|
|
|
c->need_gc = 0;
|
|
|
|
|
|
|
|
for (i = 0; i < KEY_PTRS(&c->uuid_bucket); i++)
|
|
|
|
SET_GC_MARK(PTR_BUCKET(c, &c->uuid_bucket, i),
|
|
|
|
GC_MARK_METADATA);
|
|
|
|
|
2013-11-27 11:14:23 +08:00
|
|
|
/* don't reclaim buckets to which writeback keys point */
|
|
|
|
rcu_read_lock();
|
2018-01-09 04:21:28 +08:00
|
|
|
for (i = 0; i < c->devices_max_used; i++) {
|
2013-11-27 11:14:23 +08:00
|
|
|
struct bcache_device *d = c->devices[i];
|
|
|
|
struct cached_dev *dc;
|
|
|
|
struct keybuf_key *w, *n;
|
|
|
|
unsigned j;
|
|
|
|
|
|
|
|
if (!d || UUID_FLASH_ONLY(&c->uuids[i]))
|
|
|
|
continue;
|
|
|
|
dc = container_of(d, struct cached_dev, disk);
|
|
|
|
|
|
|
|
spin_lock(&dc->writeback_keys.lock);
|
|
|
|
rbtree_postorder_for_each_entry_safe(w, n,
|
|
|
|
&dc->writeback_keys.keys, node)
|
|
|
|
for (j = 0; j < KEY_PTRS(&w->key); j++)
|
|
|
|
SET_GC_MARK(PTR_BUCKET(c, &w->key, j),
|
|
|
|
GC_MARK_DIRTY);
|
|
|
|
spin_unlock(&dc->writeback_keys.lock);
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
2017-10-31 05:46:33 +08:00
|
|
|
c->avail_nbuckets = 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
for_each_cache(ca, c, i) {
|
|
|
|
uint64_t *i;
|
|
|
|
|
|
|
|
ca->invalidate_needs_gc = 0;
|
|
|
|
|
|
|
|
for (i = ca->sb.d; i < ca->sb.d + ca->sb.keys; i++)
|
|
|
|
SET_GC_MARK(ca->buckets + *i, GC_MARK_METADATA);
|
|
|
|
|
|
|
|
for (i = ca->prio_buckets;
|
|
|
|
i < ca->prio_buckets + prio_buckets(ca) * 2; i++)
|
|
|
|
SET_GC_MARK(ca->buckets + *i, GC_MARK_METADATA);
|
|
|
|
|
|
|
|
for_each_bucket(b, ca) {
|
|
|
|
c->need_gc = max(c->need_gc, bucket_gc_gen(b));
|
|
|
|
|
2014-03-14 04:46:29 +08:00
|
|
|
if (atomic_read(&b->pin))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
BUG_ON(!GC_MARK(b) && GC_SECTORS_USED(b));
|
|
|
|
|
|
|
|
if (!GC_MARK(b) || GC_MARK(b) == GC_MARK_RECLAIMABLE)
|
2017-10-31 05:46:33 +08:00
|
|
|
c->avail_nbuckets++;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
|
|
|
}
|
|
|
|
|
2013-10-25 08:19:26 +08:00
|
|
|
static void bch_btree_gc(struct cache_set *c)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct gc_stat stats;
|
|
|
|
struct closure writes;
|
|
|
|
struct btree_op op;
|
|
|
|
uint64_t start_time = local_clock();
|
2013-04-26 04:58:35 +08:00
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_gc_start(c);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
memset(&stats, 0, sizeof(struct gc_stat));
|
|
|
|
closure_init_stack(&writes);
|
2013-07-25 09:04:18 +08:00
|
|
|
bch_btree_op_init(&op, SHRT_MAX);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
btree_gc_start(c);
|
|
|
|
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
/* if CACHE_SET_IO_DISABLE set, gc thread should stop too */
|
2013-09-11 10:07:00 +08:00
|
|
|
do {
|
|
|
|
ret = btree_root(gc_root, c, &op, &writes, &stats);
|
|
|
|
closure_sync(&writes);
|
2015-11-30 09:18:33 +08:00
|
|
|
cond_resched();
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (ret && ret != -EAGAIN)
|
|
|
|
pr_warn("gc failed!");
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
} while (ret && !test_bit(CACHE_SET_IO_DISABLE, &c->flags));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2017-10-31 05:46:33 +08:00
|
|
|
bch_btree_gc_finish(c);
|
2013-04-26 04:58:35 +08:00
|
|
|
wake_up_allocators(c);
|
|
|
|
|
2013-03-29 02:50:55 +08:00
|
|
|
bch_time_stats_update(&c->btree_gc_time, start_time);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
stats.key_bytes *= sizeof(uint64_t);
|
|
|
|
stats.data <<= 9;
|
2017-10-31 05:46:33 +08:00
|
|
|
bch_update_bucket_in_use(c, &stats);
|
2013-03-24 07:11:31 +08:00
|
|
|
memcpy(&c->gc_stats, &stats, sizeof(struct gc_stat));
|
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_gc_end(c);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-10-25 08:19:26 +08:00
|
|
|
bch_moving_gc(c);
|
|
|
|
}
|
|
|
|
|
2016-10-27 11:31:17 +08:00
|
|
|
static bool gc_should_run(struct cache_set *c)
|
2013-10-25 08:19:26 +08:00
|
|
|
{
|
2013-09-11 10:07:00 +08:00
|
|
|
struct cache *ca;
|
|
|
|
unsigned i;
|
2013-10-25 08:19:26 +08:00
|
|
|
|
2016-10-27 11:31:17 +08:00
|
|
|
for_each_cache(ca, c, i)
|
|
|
|
if (ca->invalidate_needs_gc)
|
|
|
|
return true;
|
2013-10-25 08:19:26 +08:00
|
|
|
|
2016-10-27 11:31:17 +08:00
|
|
|
if (atomic_read(&c->sectors_to_gc) < 0)
|
|
|
|
return true;
|
2013-10-25 08:19:26 +08:00
|
|
|
|
2016-10-27 11:31:17 +08:00
|
|
|
return false;
|
|
|
|
}
|
2013-09-11 10:07:00 +08:00
|
|
|
|
2016-10-27 11:31:17 +08:00
|
|
|
static int bch_gc_thread(void *arg)
|
|
|
|
{
|
|
|
|
struct cache_set *c = arg;
|
2013-09-11 10:07:00 +08:00
|
|
|
|
2016-10-27 11:31:17 +08:00
|
|
|
while (1) {
|
|
|
|
wait_event_interruptible(c->gc_wait,
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
kthread_should_stop() ||
|
|
|
|
test_bit(CACHE_SET_IO_DISABLE, &c->flags) ||
|
|
|
|
gc_should_run(c));
|
2013-09-11 10:07:00 +08:00
|
|
|
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
if (kthread_should_stop() ||
|
|
|
|
test_bit(CACHE_SET_IO_DISABLE, &c->flags))
|
2016-10-27 11:31:17 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
set_gc_sectors(c);
|
|
|
|
bch_btree_gc(c);
|
2013-10-25 08:19:26 +08:00
|
|
|
}
|
|
|
|
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
wait_for_kthread_stop();
|
2013-10-25 08:19:26 +08:00
|
|
|
return 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-10-25 08:19:26 +08:00
|
|
|
int bch_gc_thread_start(struct cache_set *c)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2016-10-27 11:31:17 +08:00
|
|
|
c->gc_thread = kthread_run(bch_gc_thread, c, "bcache_gc");
|
2018-01-09 04:21:20 +08:00
|
|
|
return PTR_ERR_OR_ZERO(c->gc_thread);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Initial partial gc */
|
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
static int bch_btree_check_recurse(struct btree *b, struct btree_op *op)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-09-11 08:18:59 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct bkey *k, *p = NULL;
|
2013-03-24 07:11:31 +08:00
|
|
|
struct btree_iter iter;
|
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
for_each_key_filter(&b->keys, k, &iter, bch_ptr_invalid)
|
|
|
|
bch_initial_mark_key(b->c, b->level, k);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
bch_initial_mark_key(b->c, b->level + 1, &b->key);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (b->level) {
|
2013-11-12 09:35:24 +08:00
|
|
|
bch_btree_iter_init(&b->keys, &iter, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 08:18:59 +08:00
|
|
|
do {
|
2013-12-21 09:28:16 +08:00
|
|
|
k = bch_btree_iter_next_filter(&iter, &b->keys,
|
|
|
|
bch_ptr_bad);
|
2013-09-11 08:18:59 +08:00
|
|
|
if (k)
|
2014-07-12 15:22:53 +08:00
|
|
|
btree_node_prefetch(b, k);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 08:18:59 +08:00
|
|
|
if (p)
|
2014-03-18 06:13:26 +08:00
|
|
|
ret = btree(check_recurse, p, b, op);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 08:18:59 +08:00
|
|
|
p = k;
|
|
|
|
} while (p && !ret);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
return ret;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-25 08:44:17 +08:00
|
|
|
int bch_btree_check(struct cache_set *c)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-07-25 08:44:17 +08:00
|
|
|
struct btree_op op;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:04:18 +08:00
|
|
|
bch_btree_op_init(&op, SHRT_MAX);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
return btree_root(check_recurse, c, &op);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-18 07:55:55 +08:00
|
|
|
void bch_initial_gc_finish(struct cache_set *c)
|
|
|
|
{
|
|
|
|
struct cache *ca;
|
|
|
|
struct bucket *b;
|
|
|
|
unsigned i;
|
|
|
|
|
|
|
|
bch_btree_gc_finish(c);
|
|
|
|
|
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to put some unused buckets directly on the prio freelist in
|
|
|
|
* order to get the allocator thread started - it needs freed buckets in
|
|
|
|
* order to rewrite the prios and gens, and it needs to rewrite prios
|
|
|
|
* and gens in order to free buckets.
|
|
|
|
*
|
|
|
|
* This is only safe for buckets that have no live data in them, which
|
|
|
|
* there should always be some of.
|
|
|
|
*/
|
|
|
|
for_each_cache(ca, c, i) {
|
|
|
|
for_each_bucket(b, ca) {
|
bcache: fix for allocator and register thread race
After long time running of random small IO writing,
I reboot the machine, and after the machine power on,
I found bcache got stuck, the stack is:
[root@ceph153 ~]# cat /proc/2510/task/*/stack
[<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
[<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
[<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
[<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
[<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[<ffffffff810a631f>] kthread+0xcf/0xe0
[<ffffffff8164c318>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff
[root@ceph153 ~]# cat /proc/2038/task/*/stack
[<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
[<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
[<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
[<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
[<ffffffff812f702f>] kobj_attr_store+0xf/0x20
[<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
[<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
[<ffffffff811e069f>] SyS_write+0x7f/0xe0
[<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
The stack shows the register thread and allocator thread
were getting stuck when registering cache device.
I reboot the machine several times, the issue always
exsit in this machine.
I debug the code, and found the call trace as bellow:
register_bcache()
==>run_cache_set()
==>bch_journal_replay()
==>bch_btree_insert()
==>__bch_btree_map_nodes()
==>btree_insert_fn()
==>btree_split() //node need split
==>btree_check_reserve()
In btree_check_reserve(), It will check if there is enough buckets
of RESERVE_BTREE type, since allocator thread did not work yet, so
no buckets of RESERVE_BTREE type allocated, so the register thread
waits on c->btree_cache_wait, and goes to sleep.
Then the allocator thread initialized, the call trace is bellow:
bch_allocator_thread()
==>bch_prio_write()
==>bch_journal_meta()
==>bch_journal()
==>journal_wait_for_write()
In journal_wait_for_write(), It will check if journal is full by
journal_full(), but the long time random small IO writing
causes the exhaustion of journal buckets(journal.blocks_free=0),
In order to release the journal buckets,
the allocator calls btree_flush_write() to flush keys to
btree nodes, and waits on c->journal.wait until btree nodes writing
over or there has already some journal buckets space, then the
allocator thread goes to sleep. but in btree_flush_write(), since
bch_journal_replay() is not finished, so no btree nodes have journal
(condition "if (btree_current_write(b)->journal)" never satisfied),
so we got no btree node to flush, no journal bucket released,
and allocator sleep all the times.
Through the above analysis, we can see that:
1) Register thread wait for allocator thread to allocate buckets of
RESERVE_BTREE type;
2) Alloctor thread wait for register thread to replay journal, so it
can flush btree nodes and get journal bucket.
then they are all got stuck by waiting for each other.
Hua Rui provided a patch for me, by allocating some buckets of
RESERVE_BTREE type in advance, so the register thread can get bucket
when btree node splitting and no need to waiting for the allocator
thread. I tested it, it has effect, and register thread run a step
forward, but finally are still got stuck, the reason is only 8 bucket
of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
then btree_check_reserve() is not satisfied anymore, so it goes to sleep
again, and in the same time, alloctor thread did not flush enough btree
nodes to release a journal bucket, so they all got stuck again.
So we need to allocate more buckets of RESERVE_BTREE type in advance,
but how much is enough? By experience and test, I think it should be
as much as journal buckets. Then I modify the code as this patch,
and test in the machine, and it works.
This patch modified base on Hua Rui’s patch, and allocate more buckets
of RESERVE_BTREE type in advance to avoid register thread and allocate
thread going to wait for each other.
[patch v2] ca->sb.njournal_buckets would be 0 in the first time after
cache creation, and no journal exists, so just 8 btree buckets is OK.
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-08 03:41:43 +08:00
|
|
|
if (fifo_full(&ca->free[RESERVE_PRIO]) &&
|
|
|
|
fifo_full(&ca->free[RESERVE_BTREE]))
|
2014-03-18 07:55:55 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
if (bch_can_invalidate_bucket(ca, b) &&
|
|
|
|
!GC_MARK(b)) {
|
|
|
|
__bch_invalidate_one_bucket(ca, b);
|
bcache: fix for allocator and register thread race
After long time running of random small IO writing,
I reboot the machine, and after the machine power on,
I found bcache got stuck, the stack is:
[root@ceph153 ~]# cat /proc/2510/task/*/stack
[<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
[<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
[<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
[<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
[<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[<ffffffff810a631f>] kthread+0xcf/0xe0
[<ffffffff8164c318>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff
[root@ceph153 ~]# cat /proc/2038/task/*/stack
[<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
[<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
[<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
[<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
[<ffffffff812f702f>] kobj_attr_store+0xf/0x20
[<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
[<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
[<ffffffff811e069f>] SyS_write+0x7f/0xe0
[<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
The stack shows the register thread and allocator thread
were getting stuck when registering cache device.
I reboot the machine several times, the issue always
exsit in this machine.
I debug the code, and found the call trace as bellow:
register_bcache()
==>run_cache_set()
==>bch_journal_replay()
==>bch_btree_insert()
==>__bch_btree_map_nodes()
==>btree_insert_fn()
==>btree_split() //node need split
==>btree_check_reserve()
In btree_check_reserve(), It will check if there is enough buckets
of RESERVE_BTREE type, since allocator thread did not work yet, so
no buckets of RESERVE_BTREE type allocated, so the register thread
waits on c->btree_cache_wait, and goes to sleep.
Then the allocator thread initialized, the call trace is bellow:
bch_allocator_thread()
==>bch_prio_write()
==>bch_journal_meta()
==>bch_journal()
==>journal_wait_for_write()
In journal_wait_for_write(), It will check if journal is full by
journal_full(), but the long time random small IO writing
causes the exhaustion of journal buckets(journal.blocks_free=0),
In order to release the journal buckets,
the allocator calls btree_flush_write() to flush keys to
btree nodes, and waits on c->journal.wait until btree nodes writing
over or there has already some journal buckets space, then the
allocator thread goes to sleep. but in btree_flush_write(), since
bch_journal_replay() is not finished, so no btree nodes have journal
(condition "if (btree_current_write(b)->journal)" never satisfied),
so we got no btree node to flush, no journal bucket released,
and allocator sleep all the times.
Through the above analysis, we can see that:
1) Register thread wait for allocator thread to allocate buckets of
RESERVE_BTREE type;
2) Alloctor thread wait for register thread to replay journal, so it
can flush btree nodes and get journal bucket.
then they are all got stuck by waiting for each other.
Hua Rui provided a patch for me, by allocating some buckets of
RESERVE_BTREE type in advance, so the register thread can get bucket
when btree node splitting and no need to waiting for the allocator
thread. I tested it, it has effect, and register thread run a step
forward, but finally are still got stuck, the reason is only 8 bucket
of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
then btree_check_reserve() is not satisfied anymore, so it goes to sleep
again, and in the same time, alloctor thread did not flush enough btree
nodes to release a journal bucket, so they all got stuck again.
So we need to allocate more buckets of RESERVE_BTREE type in advance,
but how much is enough? By experience and test, I think it should be
as much as journal buckets. Then I modify the code as this patch,
and test in the machine, and it works.
This patch modified base on Hua Rui’s patch, and allocate more buckets
of RESERVE_BTREE type in advance to avoid register thread and allocate
thread going to wait for each other.
[patch v2] ca->sb.njournal_buckets would be 0 in the first time after
cache creation, and no journal exists, so just 8 btree buckets is OK.
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-08 03:41:43 +08:00
|
|
|
if (!fifo_push(&ca->free[RESERVE_PRIO],
|
|
|
|
b - ca->buckets))
|
|
|
|
fifo_push(&ca->free[RESERVE_BTREE],
|
|
|
|
b - ca->buckets);
|
2014-03-18 07:55:55 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Btree insertion */
|
|
|
|
|
2013-11-12 09:02:31 +08:00
|
|
|
static bool btree_insert_key(struct btree *b, struct bkey *k,
|
|
|
|
struct bkey *replace_key)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-11-12 09:02:31 +08:00
|
|
|
unsigned status;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
BUG_ON(bkey_cmp(k, &b->key) > 0);
|
2013-11-11 13:55:27 +08:00
|
|
|
|
2013-11-12 09:02:31 +08:00
|
|
|
status = bch_btree_insert_key(&b->keys, k, replace_key);
|
|
|
|
if (status != BTREE_INSERT_STATUS_NO_INSERT) {
|
|
|
|
bch_check_keys(&b->keys, "%u for %s", status,
|
|
|
|
replace_key ? "replace" : "insert");
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-11-12 09:02:31 +08:00
|
|
|
trace_bcache_btree_insert_key(b, k, replace_key != NULL,
|
|
|
|
status);
|
|
|
|
return true;
|
|
|
|
} else
|
|
|
|
return false;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-11-12 11:03:54 +08:00
|
|
|
static size_t insert_u64s_remaining(struct btree *b)
|
|
|
|
{
|
2014-01-11 10:53:02 +08:00
|
|
|
long ret = bch_btree_keys_u64s_remaining(&b->keys);
|
2013-11-12 11:03:54 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Might land in the middle of an existing extent and have to split it
|
|
|
|
*/
|
|
|
|
if (b->keys.ops->is_extents)
|
|
|
|
ret -= KEY_MAX_U64S;
|
|
|
|
|
|
|
|
return max(ret, 0L);
|
|
|
|
}
|
|
|
|
|
2013-09-11 09:41:15 +08:00
|
|
|
static bool bch_btree_insert_keys(struct btree *b, struct btree_op *op,
|
2013-09-11 09:52:54 +08:00
|
|
|
struct keylist *insert_keys,
|
|
|
|
struct bkey *replace_key)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
bool ret = false;
|
2013-12-18 15:47:33 +08:00
|
|
|
int oldsize = bch_count_data(&b->keys);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:41:15 +08:00
|
|
|
while (!bch_keylist_empty(insert_keys)) {
|
2013-07-25 08:24:25 +08:00
|
|
|
struct bkey *k = insert_keys->keys;
|
2013-09-11 09:41:15 +08:00
|
|
|
|
2013-11-12 11:03:54 +08:00
|
|
|
if (bkey_u64s(k) > insert_u64s_remaining(b))
|
2013-07-25 08:22:44 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
if (bkey_cmp(k, &b->key) <= 0) {
|
2013-07-25 07:46:42 +08:00
|
|
|
if (!b->level)
|
|
|
|
bkey_put(b->c, k);
|
2013-09-11 09:41:15 +08:00
|
|
|
|
2013-11-12 09:02:31 +08:00
|
|
|
ret |= btree_insert_key(b, k, replace_key);
|
2013-09-11 09:41:15 +08:00
|
|
|
bch_keylist_pop_front(insert_keys);
|
|
|
|
} else if (bkey_cmp(&START_KEY(k), &b->key) < 0) {
|
|
|
|
BKEY_PADDED(key) temp;
|
2013-07-25 08:24:25 +08:00
|
|
|
bkey_copy(&temp.key, insert_keys->keys);
|
2013-09-11 09:41:15 +08:00
|
|
|
|
|
|
|
bch_cut_back(&b->key, &temp.key);
|
2013-07-25 08:24:25 +08:00
|
|
|
bch_cut_front(&b->key, insert_keys->keys);
|
2013-09-11 09:41:15 +08:00
|
|
|
|
2013-11-12 09:02:31 +08:00
|
|
|
ret |= btree_insert_key(b, &temp.key, replace_key);
|
2013-09-11 09:41:15 +08:00
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
break;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-11-12 09:02:31 +08:00
|
|
|
if (!ret)
|
|
|
|
op->insert_collision = true;
|
|
|
|
|
2013-07-25 08:22:44 +08:00
|
|
|
BUG_ON(!bch_keylist_empty(insert_keys) && b->level);
|
|
|
|
|
2013-12-18 15:47:33 +08:00
|
|
|
BUG_ON(bch_count_data(&b->keys) < oldsize);
|
2013-03-24 07:11:31 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-09-11 09:41:15 +08:00
|
|
|
static int btree_split(struct btree *b, struct btree_op *op,
|
|
|
|
struct keylist *insert_keys,
|
2013-09-11 09:52:54 +08:00
|
|
|
struct bkey *replace_key)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-07-25 08:20:19 +08:00
|
|
|
bool split;
|
2013-03-24 07:11:31 +08:00
|
|
|
struct btree *n1, *n2 = NULL, *n3 = NULL;
|
|
|
|
uint64_t start_time = local_clock();
|
2013-07-25 09:04:18 +08:00
|
|
|
struct closure cl;
|
2013-07-27 03:32:38 +08:00
|
|
|
struct keylist parent_keys;
|
2013-07-25 09:04:18 +08:00
|
|
|
|
|
|
|
closure_init_stack(&cl);
|
2013-07-27 03:32:38 +08:00
|
|
|
bch_keylist_init(&parent_keys);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
if (btree_check_reserve(b, op)) {
|
|
|
|
if (!b->level)
|
|
|
|
return -EINTR;
|
|
|
|
else
|
|
|
|
WARN(1, "insufficient reserve for split\n");
|
|
|
|
}
|
2013-12-17 17:29:34 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
n1 = btree_node_alloc_replacement(b, op);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (IS_ERR(n1))
|
|
|
|
goto err;
|
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
split = set_blocks(btree_bset_first(n1),
|
|
|
|
block_bytes(n1->c)) > (btree_blocks(b) * 4) / 5;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (split) {
|
|
|
|
unsigned keys = 0;
|
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
trace_bcache_btree_node_split(b, btree_bset_first(n1)->keys);
|
2013-04-27 06:39:55 +08:00
|
|
|
|
2014-07-12 15:22:53 +08:00
|
|
|
n2 = bch_btree_node_alloc(b->c, op, b->level, b->parent);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (IS_ERR(n2))
|
|
|
|
goto err_free1;
|
|
|
|
|
2013-07-25 08:20:19 +08:00
|
|
|
if (!b->parent) {
|
2014-07-12 15:22:53 +08:00
|
|
|
n3 = bch_btree_node_alloc(b->c, op, b->level + 1, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (IS_ERR(n3))
|
|
|
|
goto err_free2;
|
|
|
|
}
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&n1->write_lock);
|
|
|
|
mutex_lock(&n2->write_lock);
|
|
|
|
|
2013-09-11 09:52:54 +08:00
|
|
|
bch_btree_insert_keys(n1, op, insert_keys, replace_key);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:20:19 +08:00
|
|
|
/*
|
|
|
|
* Has to be a linear search because we don't have an auxiliary
|
2013-03-24 07:11:31 +08:00
|
|
|
* search tree yet
|
|
|
|
*/
|
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
while (keys < (btree_bset_first(n1)->keys * 3) / 5)
|
|
|
|
keys += bkey_u64s(bset_bkey_idx(btree_bset_first(n1),
|
2013-12-18 13:56:21 +08:00
|
|
|
keys));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 13:56:21 +08:00
|
|
|
bkey_copy_key(&n1->key,
|
2013-12-18 15:49:49 +08:00
|
|
|
bset_bkey_idx(btree_bset_first(n1), keys));
|
|
|
|
keys += bkey_u64s(bset_bkey_idx(btree_bset_first(n1), keys));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
btree_bset_first(n2)->keys = btree_bset_first(n1)->keys - keys;
|
|
|
|
btree_bset_first(n1)->keys = keys;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
memcpy(btree_bset_first(n2)->start,
|
|
|
|
bset_bkey_last(btree_bset_first(n1)),
|
|
|
|
btree_bset_first(n2)->keys * sizeof(uint64_t));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
bkey_copy_key(&n2->key, &b->key);
|
|
|
|
|
2013-07-27 03:32:38 +08:00
|
|
|
bch_keylist_add(&parent_keys, &n2->key);
|
2013-07-25 09:04:18 +08:00
|
|
|
bch_btree_node_write(n2, &cl);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&n2->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
rw_unlock(true, n2);
|
2013-04-27 06:39:55 +08:00
|
|
|
} else {
|
2013-12-18 15:49:49 +08:00
|
|
|
trace_bcache_btree_node_compact(b, btree_bset_first(n1)->keys);
|
2013-04-27 06:39:55 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&n1->write_lock);
|
2013-09-11 09:52:54 +08:00
|
|
|
bch_btree_insert_keys(n1, op, insert_keys, replace_key);
|
2013-04-27 06:39:55 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-27 03:32:38 +08:00
|
|
|
bch_keylist_add(&parent_keys, &n1->key);
|
2013-07-25 09:04:18 +08:00
|
|
|
bch_btree_node_write(n1, &cl);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&n1->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (n3) {
|
2013-07-25 08:20:19 +08:00
|
|
|
/* Depth increases, make a new root */
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&n3->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
bkey_copy_key(&n3->key, &MAX_KEY);
|
2013-07-27 03:32:38 +08:00
|
|
|
bch_btree_insert_keys(n3, op, &parent_keys, NULL);
|
2013-07-25 09:04:18 +08:00
|
|
|
bch_btree_node_write(n3, &cl);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&n3->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:04:18 +08:00
|
|
|
closure_sync(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
bch_btree_set_root(n3);
|
|
|
|
rw_unlock(true, n3);
|
2013-07-25 08:20:19 +08:00
|
|
|
} else if (!b->parent) {
|
|
|
|
/* Root filled up but didn't need to be split */
|
2013-07-25 09:04:18 +08:00
|
|
|
closure_sync(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
bch_btree_set_root(n1);
|
|
|
|
} else {
|
2013-07-27 03:32:38 +08:00
|
|
|
/* Split a non root node */
|
2013-07-25 09:04:18 +08:00
|
|
|
closure_sync(&cl);
|
2013-07-27 03:32:38 +08:00
|
|
|
make_btree_freeing_key(b, parent_keys.top);
|
|
|
|
bch_keylist_push(&parent_keys);
|
|
|
|
|
|
|
|
bch_btree_insert_node(b->parent, op, &parent_keys, NULL, NULL);
|
|
|
|
BUG_ON(!bch_keylist_empty(&parent_keys));
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-18 09:22:34 +08:00
|
|
|
btree_node_free(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
rw_unlock(true, n1);
|
|
|
|
|
2013-03-29 02:50:55 +08:00
|
|
|
bch_time_stats_update(&b->c->btree_split_time, start_time);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
err_free2:
|
2013-12-17 08:38:49 +08:00
|
|
|
bkey_put(b->c, &n2->key);
|
2013-07-25 08:27:07 +08:00
|
|
|
btree_node_free(n2);
|
2013-03-24 07:11:31 +08:00
|
|
|
rw_unlock(true, n2);
|
|
|
|
err_free1:
|
2013-12-17 08:38:49 +08:00
|
|
|
bkey_put(b->c, &n1->key);
|
2013-07-25 08:27:07 +08:00
|
|
|
btree_node_free(n1);
|
2013-03-24 07:11:31 +08:00
|
|
|
rw_unlock(true, n1);
|
|
|
|
err:
|
2014-03-18 08:15:53 +08:00
|
|
|
WARN(1, "bcache: btree split failed (level %u)", b->level);
|
2013-12-17 08:38:49 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
if (n3 == ERR_PTR(-EAGAIN) ||
|
|
|
|
n2 == ERR_PTR(-EAGAIN) ||
|
|
|
|
n1 == ERR_PTR(-EAGAIN))
|
|
|
|
return -EAGAIN;
|
|
|
|
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2013-09-11 09:41:15 +08:00
|
|
|
static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
|
2013-07-25 08:44:17 +08:00
|
|
|
struct keylist *insert_keys,
|
2013-09-11 09:52:54 +08:00
|
|
|
atomic_t *journal_ref,
|
|
|
|
struct bkey *replace_key)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2014-03-05 08:42:42 +08:00
|
|
|
struct closure cl;
|
|
|
|
|
2013-07-27 03:32:38 +08:00
|
|
|
BUG_ON(b->level && replace_key);
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
closure_init_stack(&cl);
|
|
|
|
|
|
|
|
mutex_lock(&b->write_lock);
|
|
|
|
|
|
|
|
if (write_block(b) != btree_bset_last(b) &&
|
|
|
|
b->keys.last_set_unwritten)
|
|
|
|
bch_btree_init_next(b); /* just wrote a set */
|
|
|
|
|
2013-11-12 11:03:54 +08:00
|
|
|
if (bch_keylist_nkeys(insert_keys) > insert_u64s_remaining(b)) {
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&b->write_lock);
|
|
|
|
goto split;
|
|
|
|
}
|
2013-12-07 19:57:58 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
BUG_ON(write_block(b) != btree_bset_last(b));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
if (bch_btree_insert_keys(b, op, insert_keys, replace_key)) {
|
|
|
|
if (!b->level)
|
|
|
|
bch_btree_leaf_dirty(b, journal_ref);
|
|
|
|
else
|
|
|
|
bch_btree_node_write(b, &cl);
|
|
|
|
}
|
2013-07-27 03:32:38 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&b->write_lock);
|
|
|
|
|
|
|
|
/* wait for btree node write if necessary, after unlock */
|
|
|
|
closure_sync(&cl);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
split:
|
|
|
|
if (current->bio_list) {
|
|
|
|
op->lock = b->c->root->level + 1;
|
|
|
|
return -EAGAIN;
|
|
|
|
} else if (op->lock <= b->c->root->level) {
|
|
|
|
op->lock = b->c->root->level + 1;
|
|
|
|
return -EINTR;
|
|
|
|
} else {
|
|
|
|
/* Invalidated all iterators */
|
|
|
|
int ret = btree_split(b, op, insert_keys, replace_key);
|
|
|
|
|
|
|
|
if (bch_keylist_empty(insert_keys))
|
|
|
|
return 0;
|
|
|
|
else if (!ret)
|
|
|
|
return -EINTR;
|
|
|
|
return ret;
|
2013-07-27 03:32:38 +08:00
|
|
|
}
|
2013-09-11 09:41:15 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:39:16 +08:00
|
|
|
int bch_btree_insert_check_key(struct btree *b, struct btree_op *op,
|
|
|
|
struct bkey *check_key)
|
|
|
|
{
|
|
|
|
int ret = -EINTR;
|
|
|
|
uint64_t btree_ptr = b->key.ptr[0];
|
|
|
|
unsigned long seq = b->seq;
|
|
|
|
struct keylist insert;
|
|
|
|
bool upgrade = op->lock == -1;
|
|
|
|
|
|
|
|
bch_keylist_init(&insert);
|
|
|
|
|
|
|
|
if (upgrade) {
|
|
|
|
rw_unlock(false, b);
|
|
|
|
rw_lock(true, b, b->level);
|
|
|
|
|
|
|
|
if (b->key.ptr[0] != btree_ptr ||
|
2015-11-30 09:17:05 +08:00
|
|
|
b->seq != seq + 1) {
|
|
|
|
op->lock = b->level;
|
2013-09-11 09:39:16 +08:00
|
|
|
goto out;
|
2015-11-30 09:17:05 +08:00
|
|
|
}
|
2013-09-11 09:39:16 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
SET_KEY_PTRS(check_key, 1);
|
|
|
|
get_random_bytes(&check_key->ptr[0], sizeof(uint64_t));
|
|
|
|
|
|
|
|
SET_PTR_DEV(check_key, 0, PTR_CHECK_DEV);
|
|
|
|
|
|
|
|
bch_keylist_add(&insert, check_key);
|
|
|
|
|
2013-09-11 09:52:54 +08:00
|
|
|
ret = bch_btree_insert_node(b, op, &insert, NULL, NULL);
|
2013-09-11 09:39:16 +08:00
|
|
|
|
|
|
|
BUG_ON(!ret && !bch_keylist_empty(&insert));
|
|
|
|
out:
|
|
|
|
if (upgrade)
|
|
|
|
downgrade_write(&b->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
struct btree_insert_op {
|
|
|
|
struct btree_op op;
|
|
|
|
struct keylist *keys;
|
|
|
|
atomic_t *journal_ref;
|
|
|
|
struct bkey *replace_key;
|
|
|
|
};
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-11-28 10:31:35 +08:00
|
|
|
static int btree_insert_fn(struct btree_op *b_op, struct btree *b)
|
2013-07-25 09:07:22 +08:00
|
|
|
{
|
|
|
|
struct btree_insert_op *op = container_of(b_op,
|
|
|
|
struct btree_insert_op, op);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
int ret = bch_btree_insert_node(b, &op->op, op->keys,
|
|
|
|
op->journal_ref, op->replace_key);
|
|
|
|
if (ret && !bch_keylist_empty(op->keys))
|
|
|
|
return ret;
|
|
|
|
else
|
|
|
|
return MAP_DONE;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
int bch_btree_insert(struct cache_set *c, struct keylist *keys,
|
|
|
|
atomic_t *journal_ref, struct bkey *replace_key)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-07-25 09:07:22 +08:00
|
|
|
struct btree_insert_op op;
|
2013-03-24 07:11:31 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
BUG_ON(current->bio_list);
|
2013-09-11 09:46:36 +08:00
|
|
|
BUG_ON(bch_keylist_empty(keys));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
bch_btree_op_init(&op.op, 0);
|
|
|
|
op.keys = keys;
|
|
|
|
op.journal_ref = journal_ref;
|
|
|
|
op.replace_key = replace_key;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
while (!ret && !bch_keylist_empty(keys)) {
|
|
|
|
op.op.lock = 0;
|
|
|
|
ret = bch_btree_map_leaf_nodes(&op.op, c,
|
|
|
|
&START_KEY(keys->keys),
|
|
|
|
btree_insert_fn);
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
if (ret) {
|
|
|
|
struct bkey *k;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
pr_err("error %i", ret);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
while ((k = bch_keylist_pop(keys)))
|
2013-07-25 07:46:42 +08:00
|
|
|
bkey_put(c, k);
|
2013-07-25 09:07:22 +08:00
|
|
|
} else if (op.op.insert_collision)
|
|
|
|
ret = -ESRCH;
|
2013-07-25 09:06:22 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bch_btree_set_root(struct btree *b)
|
|
|
|
{
|
|
|
|
unsigned i;
|
2013-06-27 08:25:38 +08:00
|
|
|
struct closure cl;
|
|
|
|
|
|
|
|
closure_init_stack(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_btree_set_root(b);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
BUG_ON(!b->written);
|
|
|
|
|
|
|
|
for (i = 0; i < KEY_PTRS(&b->key); i++)
|
|
|
|
BUG_ON(PTR_BUCKET(b->c, &b->key, i)->prio != BTREE_PRIO);
|
|
|
|
|
|
|
|
mutex_lock(&b->c->bucket_lock);
|
|
|
|
list_del_init(&b->list);
|
|
|
|
mutex_unlock(&b->c->bucket_lock);
|
|
|
|
|
|
|
|
b->c->root = b;
|
|
|
|
|
2013-06-27 08:25:38 +08:00
|
|
|
bch_journal_meta(b->c, &cl);
|
|
|
|
closure_sync(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
/* Map across nodes or keys */
|
|
|
|
|
|
|
|
static int bch_btree_map_nodes_recurse(struct btree *b, struct btree_op *op,
|
|
|
|
struct bkey *from,
|
|
|
|
btree_map_nodes_fn *fn, int flags)
|
|
|
|
{
|
|
|
|
int ret = MAP_CONTINUE;
|
|
|
|
|
|
|
|
if (b->level) {
|
|
|
|
struct bkey *k;
|
|
|
|
struct btree_iter iter;
|
|
|
|
|
2013-11-12 09:35:24 +08:00
|
|
|
bch_btree_iter_init(&b->keys, &iter, from);
|
2013-09-11 09:48:51 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
while ((k = bch_btree_iter_next_filter(&iter, &b->keys,
|
2013-09-11 09:48:51 +08:00
|
|
|
bch_ptr_bad))) {
|
|
|
|
ret = btree(map_nodes_recurse, k, b,
|
|
|
|
op, from, fn, flags);
|
|
|
|
from = NULL;
|
|
|
|
|
|
|
|
if (ret != MAP_CONTINUE)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!b->level || flags == MAP_ALL_NODES)
|
|
|
|
ret = fn(op, b);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int __bch_btree_map_nodes(struct btree_op *op, struct cache_set *c,
|
|
|
|
struct bkey *from, btree_map_nodes_fn *fn, int flags)
|
|
|
|
{
|
2013-07-25 09:04:18 +08:00
|
|
|
return btree_root(map_nodes_recurse, c, op, from, fn, flags);
|
2013-09-11 09:48:51 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int bch_btree_map_keys_recurse(struct btree *b, struct btree_op *op,
|
|
|
|
struct bkey *from, btree_map_keys_fn *fn,
|
|
|
|
int flags)
|
|
|
|
{
|
|
|
|
int ret = MAP_CONTINUE;
|
|
|
|
struct bkey *k;
|
|
|
|
struct btree_iter iter;
|
|
|
|
|
2013-11-12 09:35:24 +08:00
|
|
|
bch_btree_iter_init(&b->keys, &iter, from);
|
2013-09-11 09:48:51 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
while ((k = bch_btree_iter_next_filter(&iter, &b->keys, bch_ptr_bad))) {
|
2013-09-11 09:48:51 +08:00
|
|
|
ret = !b->level
|
|
|
|
? fn(op, b, k)
|
|
|
|
: btree(map_keys_recurse, k, b, op, from, fn, flags);
|
|
|
|
from = NULL;
|
|
|
|
|
|
|
|
if (ret != MAP_CONTINUE)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!b->level && (flags & MAP_END_KEY))
|
|
|
|
ret = fn(op, b, &KEY(KEY_INODE(&b->key),
|
|
|
|
KEY_OFFSET(&b->key), 0));
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int bch_btree_map_keys(struct btree_op *op, struct cache_set *c,
|
|
|
|
struct bkey *from, btree_map_keys_fn *fn, int flags)
|
|
|
|
{
|
2013-07-25 09:04:18 +08:00
|
|
|
return btree_root(map_keys_recurse, c, op, from, fn, flags);
|
2013-09-11 09:48:51 +08:00
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Keybuf code */
|
|
|
|
|
|
|
|
static inline int keybuf_cmp(struct keybuf_key *l, struct keybuf_key *r)
|
|
|
|
{
|
|
|
|
/* Overlapping keys compare equal */
|
|
|
|
if (bkey_cmp(&l->key, &START_KEY(&r->key)) <= 0)
|
|
|
|
return -1;
|
|
|
|
if (bkey_cmp(&START_KEY(&l->key), &r->key) >= 0)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int keybuf_nonoverlapping_cmp(struct keybuf_key *l,
|
|
|
|
struct keybuf_key *r)
|
|
|
|
{
|
|
|
|
return clamp_t(int64_t, bkey_cmp(&l->key, &r->key), -1, 1);
|
|
|
|
}
|
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
struct refill {
|
|
|
|
struct btree_op op;
|
2013-11-01 06:43:22 +08:00
|
|
|
unsigned nr_found;
|
2013-09-11 09:48:51 +08:00
|
|
|
struct keybuf *buf;
|
|
|
|
struct bkey *end;
|
|
|
|
keybuf_pred_fn *pred;
|
|
|
|
};
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
static int refill_keybuf_fn(struct btree_op *op, struct btree *b,
|
|
|
|
struct bkey *k)
|
|
|
|
{
|
|
|
|
struct refill *refill = container_of(op, struct refill, op);
|
|
|
|
struct keybuf *buf = refill->buf;
|
|
|
|
int ret = MAP_CONTINUE;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
if (bkey_cmp(k, refill->end) >= 0) {
|
|
|
|
ret = MAP_DONE;
|
|
|
|
goto out;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
if (!KEY_SIZE(k)) /* end key */
|
|
|
|
goto out;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
if (refill->pred(buf, k)) {
|
|
|
|
struct keybuf_key *w;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
spin_lock(&buf->lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
w = array_alloc(&buf->freelist);
|
|
|
|
if (!w) {
|
|
|
|
spin_unlock(&buf->lock);
|
|
|
|
return MAP_DONE;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
w->private = NULL;
|
|
|
|
bkey_copy(&w->key, k);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
if (RB_INSERT(&buf->keys, w, node, keybuf_cmp))
|
|
|
|
array_free(&buf->freelist, w);
|
2013-11-01 06:43:22 +08:00
|
|
|
else
|
|
|
|
refill->nr_found++;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
if (array_freelist_empty(&buf->freelist))
|
|
|
|
ret = MAP_DONE;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
spin_unlock(&buf->lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
2013-09-11 09:48:51 +08:00
|
|
|
out:
|
|
|
|
buf->last_scanned = *k;
|
|
|
|
return ret;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void bch_refill_keybuf(struct cache_set *c, struct keybuf *buf,
|
2013-06-05 21:24:39 +08:00
|
|
|
struct bkey *end, keybuf_pred_fn *pred)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct bkey start = buf->last_scanned;
|
2013-09-11 09:48:51 +08:00
|
|
|
struct refill refill;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
|
2013-07-25 09:04:18 +08:00
|
|
|
bch_btree_op_init(&refill.op, -1);
|
2013-11-01 06:43:22 +08:00
|
|
|
refill.nr_found = 0;
|
|
|
|
refill.buf = buf;
|
|
|
|
refill.end = end;
|
|
|
|
refill.pred = pred;
|
2013-09-11 09:48:51 +08:00
|
|
|
|
|
|
|
bch_btree_map_keys(&refill.op, c, &buf->last_scanned,
|
|
|
|
refill_keybuf_fn, MAP_END_KEY);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-11-01 06:43:22 +08:00
|
|
|
trace_bcache_keyscan(refill.nr_found,
|
|
|
|
KEY_INODE(&start), KEY_OFFSET(&start),
|
|
|
|
KEY_INODE(&buf->last_scanned),
|
|
|
|
KEY_OFFSET(&buf->last_scanned));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
spin_lock(&buf->lock);
|
|
|
|
|
|
|
|
if (!RB_EMPTY_ROOT(&buf->keys)) {
|
|
|
|
struct keybuf_key *w;
|
|
|
|
w = RB_FIRST(&buf->keys, struct keybuf_key, node);
|
|
|
|
buf->start = START_KEY(&w->key);
|
|
|
|
|
|
|
|
w = RB_LAST(&buf->keys, struct keybuf_key, node);
|
|
|
|
buf->end = w->key;
|
|
|
|
} else {
|
|
|
|
buf->start = MAX_KEY;
|
|
|
|
buf->end = MAX_KEY;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_unlock(&buf->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __bch_keybuf_del(struct keybuf *buf, struct keybuf_key *w)
|
|
|
|
{
|
|
|
|
rb_erase(&w->node, &buf->keys);
|
|
|
|
array_free(&buf->freelist, w);
|
|
|
|
}
|
|
|
|
|
|
|
|
void bch_keybuf_del(struct keybuf *buf, struct keybuf_key *w)
|
|
|
|
{
|
|
|
|
spin_lock(&buf->lock);
|
|
|
|
__bch_keybuf_del(buf, w);
|
|
|
|
spin_unlock(&buf->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
bool bch_keybuf_check_overlapping(struct keybuf *buf, struct bkey *start,
|
|
|
|
struct bkey *end)
|
|
|
|
{
|
|
|
|
bool ret = false;
|
|
|
|
struct keybuf_key *p, *w, s;
|
|
|
|
s.key = *start;
|
|
|
|
|
|
|
|
if (bkey_cmp(end, &buf->start) <= 0 ||
|
|
|
|
bkey_cmp(start, &buf->end) >= 0)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
spin_lock(&buf->lock);
|
|
|
|
w = RB_GREATER(&buf->keys, s, node, keybuf_nonoverlapping_cmp);
|
|
|
|
|
|
|
|
while (w && bkey_cmp(&START_KEY(&w->key), end) < 0) {
|
|
|
|
p = w;
|
|
|
|
w = RB_NEXT(w, node);
|
|
|
|
|
|
|
|
if (p->private)
|
|
|
|
ret = true;
|
|
|
|
else
|
|
|
|
__bch_keybuf_del(buf, p);
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_unlock(&buf->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct keybuf_key *bch_keybuf_next(struct keybuf *buf)
|
|
|
|
{
|
|
|
|
struct keybuf_key *w;
|
|
|
|
spin_lock(&buf->lock);
|
|
|
|
|
|
|
|
w = RB_FIRST(&buf->keys, struct keybuf_key, node);
|
|
|
|
|
|
|
|
while (w && w->private)
|
|
|
|
w = RB_NEXT(w, node);
|
|
|
|
|
|
|
|
if (w)
|
|
|
|
w->private = ERR_PTR(-EINTR);
|
|
|
|
|
|
|
|
spin_unlock(&buf->lock);
|
|
|
|
return w;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *c,
|
2013-09-11 09:48:51 +08:00
|
|
|
struct keybuf *buf,
|
|
|
|
struct bkey *end,
|
|
|
|
keybuf_pred_fn *pred)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct keybuf_key *ret;
|
|
|
|
|
|
|
|
while (1) {
|
|
|
|
ret = bch_keybuf_next(buf);
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (bkey_cmp(&buf->last_scanned, end) >= 0) {
|
|
|
|
pr_debug("scan finished");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2013-06-05 21:24:39 +08:00
|
|
|
bch_refill_keybuf(c, buf, end, pred);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-06-05 21:24:39 +08:00
|
|
|
void bch_keybuf_init(struct keybuf *buf)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
buf->last_scanned = MAX_KEY;
|
|
|
|
buf->keys = RB_ROOT;
|
|
|
|
|
|
|
|
spin_lock_init(&buf->lock);
|
|
|
|
array_allocator_init(&buf->freelist);
|
|
|
|
}
|