License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2013-03-24 07:11:31 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2010 Kent Overstreet <kent.overstreet@gmail.com>
|
|
|
|
*
|
|
|
|
* Uses a block device as cache for other block devices; optimized for SSDs.
|
|
|
|
* All allocation is done in buckets, which should match the erase block size
|
|
|
|
* of the device.
|
|
|
|
*
|
|
|
|
* Buckets containing cached data are kept on a heap sorted by priority;
|
|
|
|
* bucket priority is increased on cache hit, and periodically all the buckets
|
|
|
|
* on the heap have their priority scaled down. This currently is just used as
|
|
|
|
* an LRU but in the future should allow for more intelligent heuristics.
|
|
|
|
*
|
|
|
|
* Buckets have an 8 bit counter; freeing is accomplished by incrementing the
|
|
|
|
* counter. Garbage collection is used to remove stale pointers.
|
|
|
|
*
|
|
|
|
* Indexing is done via a btree; nodes are not necessarily fully sorted, rather
|
|
|
|
* as keys are inserted we only sort the pages that have not yet been written.
|
|
|
|
* When garbage collection is run, we resort the entire node.
|
|
|
|
*
|
2018-05-09 02:14:57 +08:00
|
|
|
* All configuration is done via sysfs; see Documentation/admin-guide/bcache.rst.
|
2013-03-24 07:11:31 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include "bcache.h"
|
|
|
|
#include "btree.h"
|
|
|
|
#include "debug.h"
|
2013-12-21 09:22:05 +08:00
|
|
|
#include "extents.h"
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/bitops.h>
|
|
|
|
#include <linux/hash.h>
|
2013-10-25 08:19:26 +08:00
|
|
|
#include <linux/kthread.h>
|
2013-03-28 01:56:28 +08:00
|
|
|
#include <linux/prefetch.h>
|
2013-03-24 07:11:31 +08:00
|
|
|
#include <linux/random.h>
|
|
|
|
#include <linux/rcupdate.h>
|
2017-02-01 23:36:40 +08:00
|
|
|
#include <linux/sched/clock.h>
|
2017-02-04 08:27:20 +08:00
|
|
|
#include <linux/rculist.h>
|
bcache: fix race in btree_flush_write()
There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.
Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.
This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.
Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.
The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.
Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
2149 err_free2:
2150 bkey_put(b->c, &n2->key);
2151 btree_node_free(n2);
2152 rw_unlock(true, n2);
2153 err_free1:
2154 bkey_put(b->c, &n1->key);
2155 btree_node_free(n1);
2156 rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.
Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 19:59:58 +08:00
|
|
|
#include <linux/delay.h>
|
2013-03-24 07:11:31 +08:00
|
|
|
#include <trace/events/bcache.h>
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Todo:
|
|
|
|
* register_bcache: Return errors out to userspace correctly
|
|
|
|
*
|
|
|
|
* Writeback: don't undirty key until after a cache flush
|
|
|
|
*
|
|
|
|
* Create an iterator for key pointers
|
|
|
|
*
|
|
|
|
* On btree write error, mark bucket such that it won't be freed from the cache
|
|
|
|
*
|
|
|
|
* Journalling:
|
|
|
|
* Check for bad keys in replay
|
|
|
|
* Propagate barriers
|
|
|
|
* Refcount journal entries in journal_replay
|
|
|
|
*
|
|
|
|
* Garbage collection:
|
|
|
|
* Finish incremental gc
|
|
|
|
* Gc should free old UUIDs, data for invalid UUIDs
|
|
|
|
*
|
|
|
|
* Provide a way to list backing device UUIDs we have data cached for, and
|
|
|
|
* probably how long it's been since we've seen them, and a way to invalidate
|
|
|
|
* dirty data for devices that will never be attached again
|
|
|
|
*
|
|
|
|
* Keep 1 min/5 min/15 min statistics of how busy a block device has been, so
|
|
|
|
* that based on that and how much dirty data we have we can keep writeback
|
|
|
|
* from being starved
|
|
|
|
*
|
|
|
|
* Add a tracepoint or somesuch to watch for writeback starvation
|
|
|
|
*
|
|
|
|
* When btree depth > 1 and splitting an interior node, we have to make sure
|
|
|
|
* alloc_bucket() cannot fail. This should be true but is not completely
|
|
|
|
* obvious.
|
|
|
|
*
|
|
|
|
* Plugging?
|
|
|
|
*
|
|
|
|
* If data write is less than hard sector size of ssd, round up offset in open
|
|
|
|
* bucket to the next whole sector
|
|
|
|
*
|
|
|
|
* Superblock needs to be fleshed out for multiple cache devices
|
|
|
|
*
|
|
|
|
* Add a sysfs tunable for the number of writeback IOs in flight
|
|
|
|
*
|
|
|
|
* Add a sysfs tunable for the number of open data buckets
|
|
|
|
*
|
|
|
|
* IO tracking: Can we track when one process is doing io on behalf of another?
|
|
|
|
* IO tracking: Don't use just an average, weigh more recent stuff higher
|
|
|
|
*
|
|
|
|
* Test module load/unload
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define MAX_NEED_GC 64
|
|
|
|
#define MAX_SAVE_PRIO 72
|
bcache: calculate the number of incremental GC nodes according to the total of btree nodes
This patch base on "[PATCH] bcache: finish incremental GC".
Since incremental GC would stop 100ms when front side I/O comes, so when
there are many btree nodes, if GC only processes constant (100) nodes each
time, GC would last a long time, and the front I/Os would run out of the
buckets (since no new bucket can be allocated during GC), and I/Os be
blocked again.
So GC should not process constant nodes, but varied nodes according to the
number of btree nodes. In this patch, GC is divided into constant (100)
times, so when there are many btree nodes, GC can process more nodes each
time, otherwise GC will process less nodes each time (but no less than
MIN_GC_NODES).
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 12:17:35 +08:00
|
|
|
#define MAX_GC_TIMES 100
|
bcache: finish incremental GC
In GC thread, we record the latest GC key in gc_done, which is expected
to be used for incremental GC, but in currently code, we didn't realize
it. When GC runs, front side IO would be blocked until the GC over, it
would be a long time if there is a lot of btree nodes.
This patch realizes incremental GC, the main ideal is that, when there
are front side I/Os, after GC some nodes (100), we stop GC, release locker
of the btree node, and go to process the front side I/Os for some times
(100 ms), then go back to GC again.
By this patch, when we doing GC, I/Os are not blocked all the time, and
there is no obvious I/Os zero jump problem any more.
Patch v2: Rename some variables and macros name as Coly suggested.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 12:17:34 +08:00
|
|
|
#define MIN_GC_NODES 100
|
|
|
|
#define GC_SLEEP_MS 100
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
#define PTR_DIRTY_BIT (((uint64_t) 1 << 36))
|
|
|
|
|
|
|
|
#define PTR_HASH(c, k) \
|
|
|
|
(((k)->ptr[0] >> c->bucket_bits) | PTR_GEN(k, 0))
|
|
|
|
|
2013-07-25 08:37:59 +08:00
|
|
|
#define insert_lock(s, b) ((b)->level <= (s)->lock)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* These macros are for recursing down the btree - they handle the details of
|
|
|
|
* locking and looking up nodes in the cache for you. They're best treated as
|
|
|
|
* mere syntax when reading code that uses them.
|
|
|
|
*
|
|
|
|
* op->lock determines whether we take a read or a write lock at a given depth.
|
|
|
|
* If you've got a read lock and find that you need a write lock (i.e. you're
|
|
|
|
* going to have to split), set op->lock and return -EINTR; btree_root() will
|
|
|
|
* call you again and you'll have the correct lock.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/**
|
|
|
|
* btree - recurse down the btree on a specified key
|
|
|
|
* @fn: function to call, which will be passed the child node
|
|
|
|
* @key: key to recurse on
|
|
|
|
* @b: parent btree node
|
|
|
|
* @op: pointer to struct btree_op
|
|
|
|
*/
|
|
|
|
#define btree(fn, key, b, op, ...) \
|
|
|
|
({ \
|
|
|
|
int _r, l = (b)->level - 1; \
|
|
|
|
bool _w = l <= (op)->lock; \
|
2014-07-12 15:22:53 +08:00
|
|
|
struct btree *_child = bch_btree_node_get((b)->c, op, key, l, \
|
|
|
|
_w, b); \
|
2013-07-25 08:37:59 +08:00
|
|
|
if (!IS_ERR(_child)) { \
|
|
|
|
_r = bch_btree_ ## fn(_child, op, ##__VA_ARGS__); \
|
|
|
|
rw_unlock(_w, _child); \
|
|
|
|
} else \
|
|
|
|
_r = PTR_ERR(_child); \
|
|
|
|
_r; \
|
|
|
|
})
|
|
|
|
|
|
|
|
/**
|
|
|
|
* btree_root - call a function on the root of the btree
|
|
|
|
* @fn: function to call, which will be passed the child node
|
|
|
|
* @c: cache set
|
|
|
|
* @op: pointer to struct btree_op
|
|
|
|
*/
|
|
|
|
#define btree_root(fn, c, op, ...) \
|
|
|
|
({ \
|
|
|
|
int _r = -EINTR; \
|
|
|
|
do { \
|
|
|
|
struct btree *_b = (c)->root; \
|
|
|
|
bool _w = insert_lock(op, _b); \
|
|
|
|
rw_lock(_w, _b, _b->level); \
|
|
|
|
if (_b == (c)->root && \
|
|
|
|
_w == insert_lock(op, _b)) { \
|
|
|
|
_r = bch_btree_ ## fn(_b, op, ##__VA_ARGS__); \
|
|
|
|
} \
|
|
|
|
rw_unlock(_w, _b); \
|
2014-03-18 08:15:53 +08:00
|
|
|
bch_cannibalize_unlock(c); \
|
2013-12-17 17:29:34 +08:00
|
|
|
if (_r == -EINTR) \
|
|
|
|
schedule(); \
|
2013-07-25 08:37:59 +08:00
|
|
|
} while (_r == -EINTR); \
|
|
|
|
\
|
2014-03-18 08:15:53 +08:00
|
|
|
finish_wait(&(c)->btree_cache_wait, &(op)->wait); \
|
2013-07-25 08:37:59 +08:00
|
|
|
_r; \
|
|
|
|
})
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
static inline struct bset *write_block(struct btree *b)
|
|
|
|
{
|
|
|
|
return ((void *) btree_bset_first(b)) + b->written * block_bytes(b->c);
|
|
|
|
}
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
static void bch_btree_init_next(struct btree *b)
|
|
|
|
{
|
|
|
|
/* If not a leaf node, always sort */
|
|
|
|
if (b->level && b->keys.nsets)
|
|
|
|
bch_btree_sort(&b->keys, &b->c->sort);
|
|
|
|
else
|
|
|
|
bch_btree_sort_lazy(&b->keys, &b->c->sort);
|
|
|
|
|
|
|
|
if (b->written < btree_blocks(b))
|
|
|
|
bch_bset_init_next(&b->keys, write_block(b),
|
|
|
|
bset_magic(&b->c->sb));
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Btree key manipulation */
|
|
|
|
|
2013-07-25 07:46:42 +08:00
|
|
|
void bkey_put(struct cache_set *c, struct bkey *k)
|
2013-09-11 09:39:16 +08:00
|
|
|
{
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i;
|
2013-09-11 09:39:16 +08:00
|
|
|
|
|
|
|
for (i = 0; i < KEY_PTRS(k); i++)
|
|
|
|
if (ptr_available(c, k, i))
|
|
|
|
atomic_dec_bug(&PTR_BUCKET(c, k, i)->pin);
|
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Btree IO */
|
|
|
|
|
|
|
|
static uint64_t btree_csum_set(struct btree *b, struct bset *i)
|
|
|
|
{
|
|
|
|
uint64_t crc = b->key.ptr[0];
|
2013-12-18 13:56:21 +08:00
|
|
|
void *data = (void *) i + 8, *end = bset_bkey_last(i);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-03-29 02:50:55 +08:00
|
|
|
crc = bch_crc64_update(crc, data, end - data);
|
2013-03-27 04:49:02 +08:00
|
|
|
return crc ^ 0xffffffffffffffffULL;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-12-18 14:49:08 +08:00
|
|
|
void bch_btree_node_read_done(struct btree *b)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
const char *err = "bad btree header";
|
2013-12-18 15:49:49 +08:00
|
|
|
struct bset *i = btree_bset_first(b);
|
2013-04-26 04:58:35 +08:00
|
|
|
struct btree_iter *iter;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2018-12-13 22:53:46 +08:00
|
|
|
/*
|
|
|
|
* c->fill_iter can allocate an iterator with more memory space
|
|
|
|
* than static MAX_BSETS.
|
|
|
|
* See the comment arount cache_set->fill_iter.
|
|
|
|
*/
|
2018-05-21 06:25:51 +08:00
|
|
|
iter = mempool_alloc(&b->c->fill_iter, GFP_NOIO);
|
2013-04-26 04:58:35 +08:00
|
|
|
iter->size = b->c->sb.bucket_size / b->c->sb.block_size;
|
2013-03-24 07:11:31 +08:00
|
|
|
iter->used = 0;
|
|
|
|
|
2013-10-25 07:36:03 +08:00
|
|
|
#ifdef CONFIG_BCACHE_DEBUG
|
2013-11-12 09:35:24 +08:00
|
|
|
iter->b = &b->keys;
|
2013-10-25 07:36:03 +08:00
|
|
|
#endif
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
if (!i->seq)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
|
|
|
for (;
|
2013-12-21 09:28:16 +08:00
|
|
|
b->written < btree_blocks(b) && i->seq == b->keys.set[0].data->seq;
|
2013-03-24 07:11:31 +08:00
|
|
|
i = write_block(b)) {
|
|
|
|
err = "unsupported bset version";
|
|
|
|
if (i->version > BCACHE_BSET_VERSION)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
err = "bad btree header";
|
2013-12-18 15:49:49 +08:00
|
|
|
if (b->written + set_blocks(i, block_bytes(b->c)) >
|
|
|
|
btree_blocks(b))
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
|
|
|
err = "bad magic";
|
2013-11-01 06:46:42 +08:00
|
|
|
if (i->magic != bset_magic(&b->c->sb))
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
|
|
|
err = "bad checksum";
|
|
|
|
switch (i->version) {
|
|
|
|
case 0:
|
|
|
|
if (i->csum != csum_set(i))
|
|
|
|
goto err;
|
|
|
|
break;
|
|
|
|
case BCACHE_BSET_VERSION:
|
|
|
|
if (i->csum != btree_csum_set(b, i))
|
|
|
|
goto err;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = "empty set";
|
2013-12-21 09:28:16 +08:00
|
|
|
if (i != b->keys.set[0].data && !i->keys)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
2013-12-18 13:56:21 +08:00
|
|
|
bch_btree_iter_push(iter, i->start, bset_bkey_last(i));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
b->written += set_blocks(i, block_bytes(b->c));
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
err = "corrupted btree";
|
|
|
|
for (i = write_block(b);
|
2013-12-21 09:28:16 +08:00
|
|
|
bset_sector_offset(&b->keys, i) < KEY_SIZE(&b->key);
|
2013-03-24 07:11:31 +08:00
|
|
|
i = ((void *) i) + block_bytes(b->c))
|
2013-12-21 09:28:16 +08:00
|
|
|
if (i->seq == b->keys.set[0].data->seq)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_btree_sort_and_fix_extents(&b->keys, iter, &b->c->sort);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
i = b->keys.set[0].data;
|
2013-03-24 07:11:31 +08:00
|
|
|
err = "short btree key";
|
2013-12-21 09:28:16 +08:00
|
|
|
if (b->keys.set[0].size &&
|
|
|
|
bkey_cmp(&b->key, &b->keys.set[0].end) < 0)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
|
|
|
if (b->written < btree_blocks(b))
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_bset_init_next(&b->keys, write_block(b),
|
|
|
|
bset_magic(&b->c->sb));
|
2013-03-24 07:11:31 +08:00
|
|
|
out:
|
2018-05-21 06:25:51 +08:00
|
|
|
mempool_free(iter, &b->c->fill_iter);
|
2013-04-26 04:58:35 +08:00
|
|
|
return;
|
2013-03-24 07:11:31 +08:00
|
|
|
err:
|
|
|
|
set_btree_node_io_error(b);
|
2013-12-18 13:46:35 +08:00
|
|
|
bch_cache_set_error(b->c, "%s at bucket %zu, block %u, %u keys",
|
2013-03-24 07:11:31 +08:00
|
|
|
err, PTR_BUCKET_NR(b->c, &b->key, 0),
|
2013-12-18 13:46:35 +08:00
|
|
|
bset_block_offset(b, i), i->keys);
|
2013-03-24 07:11:31 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2015-07-20 21:29:37 +08:00
|
|
|
static void btree_node_read_endio(struct bio *bio)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-04-26 04:58:35 +08:00
|
|
|
struct closure *cl = bio->bi_private;
|
2018-08-11 13:19:45 +08:00
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
closure_put(cl);
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 14:49:08 +08:00
|
|
|
static void bch_btree_node_read(struct btree *b)
|
2013-04-26 04:58:35 +08:00
|
|
|
{
|
|
|
|
uint64_t start_time = local_clock();
|
|
|
|
struct closure cl;
|
|
|
|
struct bio *bio;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_btree_read(b);
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
closure_init_stack(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
bio = bch_bbio_alloc(b->c);
|
2013-10-12 06:44:27 +08:00
|
|
|
bio->bi_iter.bi_size = KEY_SIZE(&b->key) << 9;
|
2013-04-26 04:58:35 +08:00
|
|
|
bio->bi_end_io = btree_node_read_endio;
|
|
|
|
bio->bi_private = &cl;
|
2016-11-01 21:40:10 +08:00
|
|
|
bio->bi_opf = REQ_OP_READ | REQ_META;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_bio_map(bio, b->keys.set[0].data);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
bch_submit_bbio(bio, b->c, &b->key, 0);
|
|
|
|
closure_sync(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
if (bio->bi_status)
|
2013-04-26 04:58:35 +08:00
|
|
|
set_btree_node_io_error(b);
|
|
|
|
|
|
|
|
bch_bbio_free(bio, b->c);
|
|
|
|
|
|
|
|
if (btree_node_io_error(b))
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
bch_btree_node_read_done(b);
|
|
|
|
bch_time_stats_update(&b->c->btree_read_time, start_time);
|
|
|
|
|
|
|
|
return;
|
|
|
|
err:
|
2013-09-24 14:17:30 +08:00
|
|
|
bch_cache_set_error(b->c, "io error reading bucket %zu",
|
2013-04-26 04:58:35 +08:00
|
|
|
PTR_BUCKET_NR(b->c, &b->key, 0));
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void btree_complete_write(struct btree *b, struct btree_write *w)
|
|
|
|
{
|
|
|
|
if (w->prio_blocked &&
|
|
|
|
!atomic_sub_return(w->prio_blocked, &b->c->prio_blocked))
|
2013-04-25 10:01:12 +08:00
|
|
|
wake_up_allocators(b->c);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (w->journal) {
|
|
|
|
atomic_dec_bug(w->journal);
|
|
|
|
__closure_wake_up(&b->c->journal.wait);
|
|
|
|
}
|
|
|
|
|
|
|
|
w->prio_blocked = 0;
|
|
|
|
w->journal = NULL;
|
|
|
|
}
|
|
|
|
|
2013-12-17 07:27:25 +08:00
|
|
|
static void btree_node_write_unlock(struct closure *cl)
|
|
|
|
{
|
|
|
|
struct btree *b = container_of(cl, struct btree, io);
|
|
|
|
|
|
|
|
up(&b->io_mutex);
|
|
|
|
}
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
static void __btree_node_write_done(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-12-17 07:27:25 +08:00
|
|
|
struct btree *b = container_of(cl, struct btree, io);
|
2013-03-24 07:11:31 +08:00
|
|
|
struct btree_write *w = btree_prev_write(b);
|
|
|
|
|
|
|
|
bch_bbio_free(b->bio, b->c);
|
|
|
|
b->bio = NULL;
|
|
|
|
btree_complete_write(b, w);
|
|
|
|
|
|
|
|
if (btree_node_dirty(b))
|
2014-01-23 17:44:55 +08:00
|
|
|
schedule_delayed_work(&b->work, 30 * HZ);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-17 07:27:25 +08:00
|
|
|
closure_return_with_destructor(cl, btree_node_write_unlock);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
static void btree_node_write_done(struct closure *cl)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-12-17 07:27:25 +08:00
|
|
|
struct btree *b = container_of(cl, struct btree, io);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2016-09-22 15:10:01 +08:00
|
|
|
bio_free_pages(b->bio);
|
2013-04-26 04:58:35 +08:00
|
|
|
__btree_node_write_done(cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2015-07-20 21:29:37 +08:00
|
|
|
static void btree_node_write_endio(struct bio *bio)
|
2013-04-26 04:58:35 +08:00
|
|
|
{
|
|
|
|
struct closure *cl = bio->bi_private;
|
2013-12-17 07:27:25 +08:00
|
|
|
struct btree *b = container_of(cl, struct btree, io);
|
2013-04-26 04:58:35 +08:00
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
if (bio->bi_status)
|
2013-04-26 04:58:35 +08:00
|
|
|
set_btree_node_io_error(b);
|
|
|
|
|
2017-06-03 15:38:06 +08:00
|
|
|
bch_bbio_count_io_errors(b->c, bio, bio->bi_status, "writing btree");
|
2013-04-26 04:58:35 +08:00
|
|
|
closure_put(cl);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void do_btree_node_write(struct btree *b)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-12-17 07:27:25 +08:00
|
|
|
struct closure *cl = &b->io;
|
2013-12-18 15:49:49 +08:00
|
|
|
struct bset *i = btree_bset_last(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
BKEY_PADDED(key) k;
|
|
|
|
|
|
|
|
i->version = BCACHE_BSET_VERSION;
|
|
|
|
i->csum = btree_csum_set(b, i);
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
BUG_ON(b->bio);
|
|
|
|
b->bio = bch_bbio_alloc(b->c);
|
|
|
|
|
|
|
|
b->bio->bi_end_io = btree_node_write_endio;
|
2013-11-02 09:03:08 +08:00
|
|
|
b->bio->bi_private = cl;
|
2013-12-18 15:49:49 +08:00
|
|
|
b->bio->bi_iter.bi_size = roundup(set_bytes(i), block_bytes(b->c));
|
2016-11-01 21:40:10 +08:00
|
|
|
b->bio->bi_opf = REQ_OP_WRITE | REQ_META | REQ_FUA;
|
2013-03-29 02:50:55 +08:00
|
|
|
bch_bio_map(b->bio, i);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-06-27 08:25:38 +08:00
|
|
|
/*
|
|
|
|
* If we're appending to a leaf node, we don't technically need FUA -
|
|
|
|
* this write just needs to be persisted before the next journal write,
|
|
|
|
* which will be marked FLUSH|FUA.
|
|
|
|
*
|
|
|
|
* Similarly if we're writing a new btree root - the pointer is going to
|
|
|
|
* be in the next journal entry.
|
|
|
|
*
|
|
|
|
* But if we're writing a new btree node (that isn't a root) or
|
|
|
|
* appending to a non leaf btree node, we need either FUA or a flush
|
|
|
|
* when we write the parent with the new pointer. FUA is cheaper than a
|
|
|
|
* flush, and writes appending to leaf nodes aren't blocking anything so
|
|
|
|
* just make all btree node writes FUA to keep things sane.
|
|
|
|
*/
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
bkey_copy(&k.key, &b->key);
|
2013-12-18 15:49:49 +08:00
|
|
|
SET_PTR_OFFSET(&k.key, 0, PTR_OFFSET(&k.key, 0) +
|
2013-12-21 09:28:16 +08:00
|
|
|
bset_sector_offset(&b->keys, i));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2017-12-18 20:22:10 +08:00
|
|
|
if (!bch_bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
struct bio_vec *bv;
|
2019-04-25 15:02:59 +08:00
|
|
|
void *addr = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
|
2019-02-15 19:13:19 +08:00
|
|
|
struct bvec_iter_all iter_all;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2019-04-25 15:03:00 +08:00
|
|
|
bio_for_each_segment_all(bv, b->bio, iter_all) {
|
2019-04-25 15:02:59 +08:00
|
|
|
memcpy(page_address(bv->bv_page), addr, PAGE_SIZE);
|
|
|
|
addr += PAGE_SIZE;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
bch_submit_bbio(b->bio, b->c, &k.key, 0);
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
continue_at(cl, btree_node_write_done, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
} else {
|
2018-08-11 13:19:47 +08:00
|
|
|
/*
|
|
|
|
* No problem for multipage bvec since the bio is
|
|
|
|
* just allocated
|
|
|
|
*/
|
2013-03-24 07:11:31 +08:00
|
|
|
b->bio->bi_vcnt = 0;
|
2013-03-29 02:50:55 +08:00
|
|
|
bch_bio_map(b->bio, i);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
bch_submit_bbio(b->bio, b->c, &k.key, 0);
|
|
|
|
|
|
|
|
closure_sync(cl);
|
2013-12-17 07:27:25 +08:00
|
|
|
continue_at_nobarrier(cl, __btree_node_write_done, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
void __bch_btree_node_write(struct btree *b, struct closure *parent)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-12-18 15:49:49 +08:00
|
|
|
struct bset *i = btree_bset_last(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
lockdep_assert_held(&b->write_lock);
|
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_btree_write(b);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
BUG_ON(current->bio_list);
|
2013-04-26 04:58:35 +08:00
|
|
|
BUG_ON(b->written >= btree_blocks(b));
|
|
|
|
BUG_ON(b->written && !i->keys);
|
2013-12-18 15:49:49 +08:00
|
|
|
BUG_ON(btree_bset_first(b)->seq != i->seq);
|
2013-12-18 15:47:33 +08:00
|
|
|
bch_check_keys(&b->keys, "writing");
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
cancel_delayed_work(&b->work);
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
/* If caller isn't waiting for write, parent refcount is cache set */
|
2013-12-17 07:27:25 +08:00
|
|
|
down(&b->io_mutex);
|
|
|
|
closure_init(&b->io, parent ?: &b->c->cl);
|
2013-04-26 04:58:35 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
clear_bit(BTREE_NODE_dirty, &b->flags);
|
|
|
|
change_bit(BTREE_NODE_write_idx, &b->flags);
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
do_btree_node_write(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
atomic_long_add(set_blocks(i, block_bytes(b->c)) * b->c->sb.block_size,
|
2013-03-24 07:11:31 +08:00
|
|
|
&PTR_CACHE(b->c, &b->key, 0)->btree_sectors_written);
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
b->written += set_blocks(i, block_bytes(b->c));
|
2014-03-05 08:42:42 +08:00
|
|
|
}
|
2013-12-21 09:28:16 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
void bch_btree_node_write(struct btree *b, struct closure *parent)
|
|
|
|
{
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int nsets = b->keys.nsets;
|
2014-03-05 08:42:42 +08:00
|
|
|
|
|
|
|
lockdep_assert_held(&b->lock);
|
|
|
|
|
|
|
|
__bch_btree_node_write(b, parent);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 14:49:08 +08:00
|
|
|
/*
|
|
|
|
* do verify if there was more than one set initially (i.e. we did a
|
|
|
|
* sort) and we sorted down to a single set:
|
|
|
|
*/
|
2014-03-05 08:42:42 +08:00
|
|
|
if (nsets && !b->keys.nsets)
|
2013-12-18 14:49:08 +08:00
|
|
|
bch_btree_verify(b);
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
bch_btree_init_next(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-24 11:48:29 +08:00
|
|
|
static void bch_btree_node_write_sync(struct btree *b)
|
|
|
|
{
|
|
|
|
struct closure cl;
|
|
|
|
|
|
|
|
closure_init_stack(&cl);
|
2014-03-05 08:42:42 +08:00
|
|
|
|
|
|
|
mutex_lock(&b->write_lock);
|
2013-07-24 11:48:29 +08:00
|
|
|
bch_btree_node_write(b, &cl);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&b->write_lock);
|
|
|
|
|
2013-07-24 11:48:29 +08:00
|
|
|
closure_sync(&cl);
|
|
|
|
}
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
static void btree_node_write_work(struct work_struct *w)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct btree *b = container_of(to_delayed_work(w), struct btree, work);
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&b->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (btree_node_dirty(b))
|
2014-03-05 08:42:42 +08:00
|
|
|
__bch_btree_node_write(b, NULL);
|
|
|
|
mutex_unlock(&b->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-25 08:44:17 +08:00
|
|
|
static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-12-18 15:49:49 +08:00
|
|
|
struct bset *i = btree_bset_last(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
struct btree_write *w = btree_current_write(b);
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
lockdep_assert_held(&b->write_lock);
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
BUG_ON(!b->written);
|
|
|
|
BUG_ON(!i->keys);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
if (!btree_node_dirty(b))
|
2014-01-23 17:44:55 +08:00
|
|
|
schedule_delayed_work(&b->work, 30 * HZ);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
set_btree_node_dirty(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:44:17 +08:00
|
|
|
if (journal_ref) {
|
2013-03-24 07:11:31 +08:00
|
|
|
if (w->journal &&
|
2013-07-25 08:44:17 +08:00
|
|
|
journal_pin_cmp(b->c, w->journal, journal_ref)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
atomic_dec_bug(w->journal);
|
|
|
|
w->journal = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!w->journal) {
|
2013-07-25 08:44:17 +08:00
|
|
|
w->journal = journal_ref;
|
2013-03-24 07:11:31 +08:00
|
|
|
atomic_inc(w->journal);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Force write if set is too big */
|
2013-04-26 04:58:35 +08:00
|
|
|
if (set_bytes(i) > PAGE_SIZE - 48 &&
|
|
|
|
!current->bio_list)
|
|
|
|
bch_btree_node_write(b, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Btree in memory cache - allocation/freeing
|
|
|
|
* mca -> memory cache
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define mca_reserve(c) (((c->root && c->root->level) \
|
|
|
|
? c->root->level : 1) * 8 + 16)
|
|
|
|
#define mca_can_free(c) \
|
2014-03-18 08:15:53 +08:00
|
|
|
max_t(int, 0, c->btree_cache_used - mca_reserve(c))
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
static void mca_data_free(struct btree *b)
|
|
|
|
{
|
2013-12-17 07:27:25 +08:00
|
|
|
BUG_ON(b->io_mutex.count != 1);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_btree_keys_free(&b->keys);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
b->c->btree_cache_used--;
|
2013-12-18 15:49:49 +08:00
|
|
|
list_move(&b->list, &b->c->btree_cache_freed);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mca_bucket_free(struct btree *b)
|
|
|
|
{
|
|
|
|
BUG_ON(btree_node_dirty(b));
|
|
|
|
|
|
|
|
b->key.ptr[0] = 0;
|
|
|
|
hlist_del_init_rcu(&b->hash);
|
|
|
|
list_move(&b->list, &b->c->btree_cache_freeable);
|
|
|
|
}
|
|
|
|
|
2018-08-11 13:19:44 +08:00
|
|
|
static unsigned int btree_order(struct bkey *k)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
return ilog2(KEY_SIZE(k) / PAGE_SECTORS ?: 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mca_data_alloc(struct btree *b, struct bkey *k, gfp_t gfp)
|
|
|
|
{
|
2013-12-21 09:28:16 +08:00
|
|
|
if (!bch_btree_keys_alloc(&b->keys,
|
2018-08-11 13:19:44 +08:00
|
|
|
max_t(unsigned int,
|
2013-12-18 15:49:49 +08:00
|
|
|
ilog2(b->c->btree_pages),
|
|
|
|
btree_order(k)),
|
|
|
|
gfp)) {
|
2014-03-18 08:15:53 +08:00
|
|
|
b->c->btree_cache_used++;
|
2013-12-18 15:49:49 +08:00
|
|
|
list_move(&b->list, &b->c->btree_cache);
|
|
|
|
} else {
|
|
|
|
list_move(&b->list, &b->c->btree_cache_freed);
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct btree *mca_bucket_alloc(struct cache_set *c,
|
|
|
|
struct bkey *k, gfp_t gfp)
|
|
|
|
{
|
2019-06-28 19:59:34 +08:00
|
|
|
/*
|
|
|
|
* kzalloc() is necessary here for initialization,
|
|
|
|
* see code comments in bch_btree_keys_init().
|
|
|
|
*/
|
2013-03-24 07:11:31 +08:00
|
|
|
struct btree *b = kzalloc(sizeof(struct btree), gfp);
|
2018-08-11 13:19:45 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
if (!b)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
init_rwsem(&b->lock);
|
|
|
|
lockdep_set_novalidate_class(&b->lock);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_init(&b->write_lock);
|
|
|
|
lockdep_set_novalidate_class(&b->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
INIT_LIST_HEAD(&b->list);
|
2013-04-26 04:58:35 +08:00
|
|
|
INIT_DELAYED_WORK(&b->work, btree_node_write_work);
|
2013-03-24 07:11:31 +08:00
|
|
|
b->c = c;
|
2013-12-17 07:27:25 +08:00
|
|
|
sema_init(&b->io_mutex, 1);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
mca_data_alloc(b, k, gfp);
|
|
|
|
return b;
|
|
|
|
}
|
|
|
|
|
2018-08-11 13:19:44 +08:00
|
|
|
static int mca_reap(struct btree *b, unsigned int min_order, bool flush)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-07-25 08:27:07 +08:00
|
|
|
struct closure cl;
|
|
|
|
|
|
|
|
closure_init_stack(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
lockdep_assert_held(&b->c->bucket_lock);
|
|
|
|
|
|
|
|
if (!down_write_trylock(&b->lock))
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
BUG_ON(btree_node_dirty(b) && !b->keys.set[0].data);
|
2013-07-25 08:27:07 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
if (b->keys.page_order < min_order)
|
2013-12-17 07:27:25 +08:00
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
if (!flush) {
|
|
|
|
if (btree_node_dirty(b))
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
if (down_trylock(&b->io_mutex))
|
|
|
|
goto out_unlock;
|
|
|
|
up(&b->io_mutex);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
bcache: fix race in btree_flush_write()
There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.
Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.
This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.
Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.
The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.
Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
2149 err_free2:
2150 bkey_put(b->c, &n2->key);
2151 btree_node_free(n2);
2152 rw_unlock(true, n2);
2153 err_free1:
2154 bkey_put(b->c, &n1->key);
2155 btree_node_free(n1);
2156 rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.
Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 19:59:58 +08:00
|
|
|
retry:
|
2019-06-28 19:59:56 +08:00
|
|
|
/*
|
|
|
|
* BTREE_NODE_dirty might be cleared in btree_flush_btree() by
|
|
|
|
* __bch_btree_node_write(). To avoid an extra flush, acquire
|
|
|
|
* b->write_lock before checking BTREE_NODE_dirty bit.
|
|
|
|
*/
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&b->write_lock);
|
bcache: fix race in btree_flush_write()
There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.
Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.
This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.
Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.
The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.
Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
2149 err_free2:
2150 bkey_put(b->c, &n2->key);
2151 btree_node_free(n2);
2152 rw_unlock(true, n2);
2153 err_free1:
2154 bkey_put(b->c, &n1->key);
2155 btree_node_free(n1);
2156 rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.
Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 19:59:58 +08:00
|
|
|
/*
|
|
|
|
* If this btree node is selected in btree_flush_write() by journal
|
|
|
|
* code, delay and retry until the node is flushed by journal code
|
|
|
|
* and BTREE_NODE_journal_flush bit cleared by btree_flush_write().
|
|
|
|
*/
|
|
|
|
if (btree_node_journal_flush(b)) {
|
|
|
|
pr_debug("bnode %p is flushing by journal, retry", b);
|
|
|
|
mutex_unlock(&b->write_lock);
|
|
|
|
udelay(1);
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
|
2013-07-24 11:48:29 +08:00
|
|
|
if (btree_node_dirty(b))
|
2014-03-05 08:42:42 +08:00
|
|
|
__bch_btree_node_write(b, &cl);
|
|
|
|
mutex_unlock(&b->write_lock);
|
|
|
|
|
|
|
|
closure_sync(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:27:07 +08:00
|
|
|
/* wait for any in flight btree write */
|
2013-12-17 07:27:25 +08:00
|
|
|
down(&b->io_mutex);
|
|
|
|
up(&b->io_mutex);
|
2013-07-25 08:27:07 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
return 0;
|
2013-12-17 07:27:25 +08:00
|
|
|
out_unlock:
|
|
|
|
rw_unlock(true, b);
|
|
|
|
return -ENOMEM;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
static unsigned long bch_mca_scan(struct shrinker *shrink,
|
|
|
|
struct shrink_control *sc)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct cache_set *c = container_of(shrink, struct cache_set, shrink);
|
|
|
|
struct btree *b, *t;
|
|
|
|
unsigned long i, nr = sc->nr_to_scan;
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
unsigned long freed = 0;
|
2018-03-19 08:36:22 +08:00
|
|
|
unsigned int btree_cache_used;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (c->shrinker_disabled)
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
return SHRINK_STOP;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
if (c->btree_cache_alloc_lock)
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
return SHRINK_STOP;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
/* Return -1 if we can't do anything right now */
|
2013-09-24 14:17:34 +08:00
|
|
|
if (sc->gfp_mask & __GFP_IO)
|
2013-03-24 07:11:31 +08:00
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
else if (!mutex_trylock(&c->bucket_lock))
|
|
|
|
return -1;
|
|
|
|
|
2013-06-04 04:04:56 +08:00
|
|
|
/*
|
|
|
|
* It's _really_ critical that we don't free too many btree nodes - we
|
|
|
|
* have to always leave ourselves a reserve. The reserve is how we
|
|
|
|
* guarantee that allocating memory for a new btree node can always
|
|
|
|
* succeed, so that inserting keys into the btree can always succeed and
|
|
|
|
* IO can always make forward progress:
|
|
|
|
*/
|
2013-03-24 07:11:31 +08:00
|
|
|
nr /= c->btree_pages;
|
2024-06-11 20:08:33 +08:00
|
|
|
if (nr == 0)
|
|
|
|
nr = 1;
|
2013-03-24 07:11:31 +08:00
|
|
|
nr = min_t(unsigned long, nr, mca_can_free(c));
|
|
|
|
|
|
|
|
i = 0;
|
2018-03-19 08:36:22 +08:00
|
|
|
btree_cache_used = c->btree_cache_used;
|
2013-03-24 07:11:31 +08:00
|
|
|
list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) {
|
2018-03-19 08:36:22 +08:00
|
|
|
if (nr <= 0)
|
|
|
|
goto out;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (++i > 3 &&
|
2013-07-25 08:27:07 +08:00
|
|
|
!mca_reap(b, 0, false)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
mca_data_free(b);
|
|
|
|
rw_unlock(true, b);
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
freed++;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
2018-03-19 08:36:22 +08:00
|
|
|
nr--;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2018-03-19 08:36:22 +08:00
|
|
|
for (; (nr--) && i < btree_cache_used; i++) {
|
2013-12-11 05:24:26 +08:00
|
|
|
if (list_empty(&c->btree_cache))
|
|
|
|
goto out;
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
b = list_first_entry(&c->btree_cache, struct btree, list);
|
|
|
|
list_rotate_left(&c->btree_cache);
|
|
|
|
|
|
|
|
if (!b->accessed &&
|
2013-07-25 08:27:07 +08:00
|
|
|
!mca_reap(b, 0, false)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
mca_bucket_free(b);
|
|
|
|
mca_data_free(b);
|
|
|
|
rw_unlock(true, b);
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
freed++;
|
2013-03-24 07:11:31 +08:00
|
|
|
} else
|
|
|
|
b->accessed = 0;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
2018-03-19 08:36:21 +08:00
|
|
|
return freed * c->btree_pages;
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long bch_mca_count(struct shrinker *shrink,
|
|
|
|
struct shrink_control *sc)
|
|
|
|
{
|
|
|
|
struct cache_set *c = container_of(shrink, struct cache_set, shrink);
|
|
|
|
|
|
|
|
if (c->shrinker_disabled)
|
|
|
|
return 0;
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
if (c->btree_cache_alloc_lock)
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
return mca_can_free(c) * c->btree_pages;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void bch_btree_cache_free(struct cache_set *c)
|
|
|
|
{
|
|
|
|
struct btree *b;
|
|
|
|
struct closure cl;
|
2018-08-11 13:19:45 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
closure_init_stack(&cl);
|
|
|
|
|
|
|
|
if (c->shrink.list.next)
|
|
|
|
unregister_shrinker(&c->shrink);
|
|
|
|
|
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
|
|
|
|
#ifdef CONFIG_BCACHE_DEBUG
|
|
|
|
if (c->verify_data)
|
|
|
|
list_move(&c->verify_data->list, &c->btree_cache);
|
2013-12-18 14:49:08 +08:00
|
|
|
|
|
|
|
free_pages((unsigned long) c->verify_ondisk, ilog2(bucket_pages(c)));
|
2013-03-24 07:11:31 +08:00
|
|
|
#endif
|
|
|
|
|
|
|
|
list_splice(&c->btree_cache_freeable,
|
|
|
|
&c->btree_cache);
|
|
|
|
|
|
|
|
while (!list_empty(&c->btree_cache)) {
|
|
|
|
b = list_first_entry(&c->btree_cache, struct btree, list);
|
|
|
|
|
2019-06-28 19:59:56 +08:00
|
|
|
/*
|
|
|
|
* This function is called by cache_set_free(), no I/O
|
|
|
|
* request on cache now, it is unnecessary to acquire
|
|
|
|
* b->write_lock before clearing BTREE_NODE_dirty anymore.
|
|
|
|
*/
|
2019-06-28 19:59:55 +08:00
|
|
|
if (btree_node_dirty(b)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
btree_complete_write(b, btree_current_write(b));
|
2019-06-28 19:59:55 +08:00
|
|
|
clear_bit(BTREE_NODE_dirty, &b->flags);
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
mca_data_free(b);
|
|
|
|
}
|
|
|
|
|
|
|
|
while (!list_empty(&c->btree_cache_freed)) {
|
|
|
|
b = list_first_entry(&c->btree_cache_freed,
|
|
|
|
struct btree, list);
|
|
|
|
list_del(&b->list);
|
|
|
|
cancel_delayed_work_sync(&b->work);
|
|
|
|
kfree(b);
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
int bch_btree_cache_alloc(struct cache_set *c)
|
|
|
|
{
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
for (i = 0; i < mca_reserve(c); i++)
|
2013-10-25 08:19:26 +08:00
|
|
|
if (!mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL))
|
|
|
|
return -ENOMEM;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
list_splice_init(&c->btree_cache,
|
|
|
|
&c->btree_cache_freeable);
|
|
|
|
|
|
|
|
#ifdef CONFIG_BCACHE_DEBUG
|
|
|
|
mutex_init(&c->verify_lock);
|
|
|
|
|
2013-12-18 14:49:08 +08:00
|
|
|
c->verify_ondisk = (void *)
|
|
|
|
__get_free_pages(GFP_KERNEL, ilog2(bucket_pages(c)));
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
c->verify_data = mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL);
|
|
|
|
|
|
|
|
if (c->verify_data &&
|
2013-12-21 09:28:16 +08:00
|
|
|
c->verify_data->keys.set->data)
|
2013-03-24 07:11:31 +08:00
|
|
|
list_del_init(&c->verify_data->list);
|
|
|
|
else
|
|
|
|
c->verify_data = NULL;
|
|
|
|
#endif
|
|
|
|
|
drivers: convert shrinkers to new count/scan API
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-28 08:18:11 +08:00
|
|
|
c->shrink.count_objects = bch_mca_count;
|
|
|
|
c->shrink.scan_objects = bch_mca_scan;
|
2013-03-24 07:11:31 +08:00
|
|
|
c->shrink.seeks = 4;
|
|
|
|
c->shrink.batch = c->btree_pages * 2;
|
2017-11-25 07:14:27 +08:00
|
|
|
|
|
|
|
if (register_shrinker(&c->shrink))
|
|
|
|
pr_warn("bcache: %s: could not register shrinker",
|
|
|
|
__func__);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Btree in memory cache - hash table */
|
|
|
|
|
|
|
|
static struct hlist_head *mca_hash(struct cache_set *c, struct bkey *k)
|
|
|
|
{
|
|
|
|
return &c->bucket_hash[hash_32(PTR_HASH(c, k), BUCKET_HASH_BITS)];
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct btree *mca_find(struct cache_set *c, struct bkey *k)
|
|
|
|
{
|
|
|
|
struct btree *b;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
hlist_for_each_entry_rcu(b, mca_hash(c, k), hash)
|
|
|
|
if (PTR_HASH(c, &b->key) == PTR_HASH(c, k))
|
|
|
|
goto out;
|
|
|
|
b = NULL;
|
|
|
|
out:
|
|
|
|
rcu_read_unlock();
|
|
|
|
return b;
|
|
|
|
}
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
static int mca_cannibalize_lock(struct cache_set *c, struct btree_op *op)
|
|
|
|
{
|
|
|
|
struct task_struct *old;
|
|
|
|
|
|
|
|
old = cmpxchg(&c->btree_cache_alloc_lock, NULL, current);
|
|
|
|
if (old && old != current) {
|
|
|
|
if (op)
|
|
|
|
prepare_to_wait(&c->btree_cache_wait, &op->wait,
|
|
|
|
TASK_UNINTERRUPTIBLE);
|
|
|
|
return -EINTR;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct btree *mca_cannibalize(struct cache_set *c, struct btree_op *op,
|
|
|
|
struct bkey *k)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-07-25 08:27:07 +08:00
|
|
|
struct btree *b;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_btree_cache_cannibalize(c);
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
if (mca_cannibalize_lock(c, op))
|
|
|
|
return ERR_PTR(-EINTR);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:27:07 +08:00
|
|
|
list_for_each_entry_reverse(b, &c->btree_cache, list)
|
|
|
|
if (!mca_reap(b, btree_order(k), false))
|
|
|
|
return b;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:27:07 +08:00
|
|
|
list_for_each_entry_reverse(b, &c->btree_cache, list)
|
|
|
|
if (!mca_reap(b, btree_order(k), true))
|
|
|
|
return b;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
WARN(1, "btree cache cannibalize failed\n");
|
2013-07-25 08:27:07 +08:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We can only have one thread cannibalizing other cached btree nodes at a time,
|
|
|
|
* or we'll deadlock. We use an open coded mutex to ensure that, which a
|
|
|
|
* cannibalize_bucket() will take. This means every time we unlock the root of
|
|
|
|
* the btree, we need to release this lock if we have it held.
|
|
|
|
*/
|
2013-07-25 08:37:59 +08:00
|
|
|
static void bch_cannibalize_unlock(struct cache_set *c)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2014-03-18 08:15:53 +08:00
|
|
|
if (c->btree_cache_alloc_lock == current) {
|
|
|
|
c->btree_cache_alloc_lock = NULL;
|
|
|
|
wake_up(&c->btree_cache_wait);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
|
|
|
|
struct bkey *k, int level)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct btree *b;
|
|
|
|
|
2013-07-25 08:27:07 +08:00
|
|
|
BUG_ON(current->bio_list);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
lockdep_assert_held(&c->bucket_lock);
|
|
|
|
|
|
|
|
if (mca_find(c, k))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/* btree_free() doesn't free memory; it sticks the node on the end of
|
|
|
|
* the list. Check if there's any freed nodes there:
|
|
|
|
*/
|
|
|
|
list_for_each_entry(b, &c->btree_cache_freeable, list)
|
2013-07-25 08:27:07 +08:00
|
|
|
if (!mca_reap(b, btree_order(k), false))
|
2013-03-24 07:11:31 +08:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* We never free struct btree itself, just the memory that holds the on
|
|
|
|
* disk node. Check the freed list before allocating a new one:
|
|
|
|
*/
|
|
|
|
list_for_each_entry(b, &c->btree_cache_freed, list)
|
2013-07-25 08:27:07 +08:00
|
|
|
if (!mca_reap(b, 0, false)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
mca_data_alloc(b, k, __GFP_NOWARN|GFP_NOIO);
|
2013-12-21 09:28:16 +08:00
|
|
|
if (!b->keys.set[0].data)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
else
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
b = mca_bucket_alloc(c, k, __GFP_NOWARN|GFP_NOIO);
|
|
|
|
if (!b)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
BUG_ON(!down_write_trylock(&b->lock));
|
2013-12-21 09:28:16 +08:00
|
|
|
if (!b->keys.set->data)
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
out:
|
2013-12-17 07:27:25 +08:00
|
|
|
BUG_ON(b->io_mutex.count != 1);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
bkey_copy(&b->key, k);
|
|
|
|
list_move(&b->list, &c->btree_cache);
|
|
|
|
hlist_del_init_rcu(&b->hash);
|
|
|
|
hlist_add_head_rcu(&b->hash, mca_hash(c, k));
|
|
|
|
|
|
|
|
lock_set_subclass(&b->lock.dep_map, level + 1, _THIS_IP_);
|
2013-07-25 08:20:19 +08:00
|
|
|
b->parent = (void *) ~0UL;
|
2013-12-21 09:28:16 +08:00
|
|
|
b->flags = 0;
|
|
|
|
b->written = 0;
|
|
|
|
b->level = level;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-21 09:22:05 +08:00
|
|
|
if (!b->level)
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_btree_keys_init(&b->keys, &bch_extent_keys_ops,
|
|
|
|
&b->c->expensive_debug_checks);
|
2013-12-21 09:22:05 +08:00
|
|
|
else
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_btree_keys_init(&b->keys, &bch_btree_keys_ops,
|
|
|
|
&b->c->expensive_debug_checks);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return b;
|
|
|
|
err:
|
|
|
|
if (b)
|
|
|
|
rw_unlock(true, b);
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
b = mca_cannibalize(c, op, k);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (!IS_ERR(b))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
return b;
|
|
|
|
}
|
|
|
|
|
2018-03-19 08:36:29 +08:00
|
|
|
/*
|
2013-03-24 07:11:31 +08:00
|
|
|
* bch_btree_node_get - find a btree node in the cache and lock it, reading it
|
|
|
|
* in from disk if necessary.
|
|
|
|
*
|
2013-07-25 09:04:18 +08:00
|
|
|
* If IO is necessary and running under generic_make_request, returns -EAGAIN.
|
2013-03-24 07:11:31 +08:00
|
|
|
*
|
|
|
|
* The btree node will have either a read or a write lock held, depending on
|
|
|
|
* level and op->lock.
|
|
|
|
*/
|
2014-03-18 08:15:53 +08:00
|
|
|
struct btree *bch_btree_node_get(struct cache_set *c, struct btree_op *op,
|
2014-07-12 15:22:53 +08:00
|
|
|
struct bkey *k, int level, bool write,
|
|
|
|
struct btree *parent)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
int i = 0;
|
|
|
|
struct btree *b;
|
|
|
|
|
|
|
|
BUG_ON(level < 0);
|
|
|
|
retry:
|
|
|
|
b = mca_find(c, k);
|
|
|
|
|
|
|
|
if (!b) {
|
2013-04-26 04:58:35 +08:00
|
|
|
if (current->bio_list)
|
|
|
|
return ERR_PTR(-EAGAIN);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
mutex_lock(&c->bucket_lock);
|
2014-03-18 08:15:53 +08:00
|
|
|
b = mca_alloc(c, op, k, level);
|
2013-03-24 07:11:31 +08:00
|
|
|
mutex_unlock(&c->bucket_lock);
|
|
|
|
|
|
|
|
if (!b)
|
|
|
|
goto retry;
|
|
|
|
if (IS_ERR(b))
|
|
|
|
return b;
|
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
bch_btree_node_read(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (!write)
|
|
|
|
downgrade_write(&b->lock);
|
|
|
|
} else {
|
|
|
|
rw_lock(write, b, level);
|
|
|
|
if (PTR_HASH(c, &b->key) != PTR_HASH(c, k)) {
|
|
|
|
rw_unlock(write, b);
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
BUG_ON(b->level != level);
|
|
|
|
}
|
|
|
|
|
2018-08-09 15:48:44 +08:00
|
|
|
if (btree_node_io_error(b)) {
|
|
|
|
rw_unlock(write, b);
|
|
|
|
return ERR_PTR(-EIO);
|
|
|
|
}
|
|
|
|
|
|
|
|
BUG_ON(!b->written);
|
|
|
|
|
2014-07-12 15:22:53 +08:00
|
|
|
b->parent = parent;
|
2013-03-24 07:11:31 +08:00
|
|
|
b->accessed = 1;
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
for (; i <= b->keys.nsets && b->keys.set[i].size; i++) {
|
|
|
|
prefetch(b->keys.set[i].tree);
|
|
|
|
prefetch(b->keys.set[i].data);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
for (; i <= b->keys.nsets; i++)
|
|
|
|
prefetch(b->keys.set[i].data);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return b;
|
|
|
|
}
|
|
|
|
|
2014-07-12 15:22:53 +08:00
|
|
|
static void btree_node_prefetch(struct btree *parent, struct bkey *k)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct btree *b;
|
|
|
|
|
2014-07-12 15:22:53 +08:00
|
|
|
mutex_lock(&parent->c->bucket_lock);
|
|
|
|
b = mca_alloc(parent->c, NULL, k, parent->level - 1);
|
|
|
|
mutex_unlock(&parent->c->bucket_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (!IS_ERR_OR_NULL(b)) {
|
2014-07-12 15:22:53 +08:00
|
|
|
b->parent = parent;
|
2013-04-26 04:58:35 +08:00
|
|
|
bch_btree_node_read(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
rw_unlock(true, b);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Btree alloc */
|
|
|
|
|
2013-07-25 08:27:07 +08:00
|
|
|
static void btree_node_free(struct btree *b)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_btree_node_free(b);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
BUG_ON(b == b->c->root);
|
|
|
|
|
bcache: fix race in btree_flush_write()
There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.
Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.
This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.
Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.
The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.
Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
2149 err_free2:
2150 bkey_put(b->c, &n2->key);
2151 btree_node_free(n2);
2152 rw_unlock(true, n2);
2153 err_free1:
2154 bkey_put(b->c, &n1->key);
2155 btree_node_free(n1);
2156 rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.
Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 19:59:58 +08:00
|
|
|
retry:
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&b->write_lock);
|
bcache: fix race in btree_flush_write()
There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.
Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.
This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.
Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.
The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.
Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
2149 err_free2:
2150 bkey_put(b->c, &n2->key);
2151 btree_node_free(n2);
2152 rw_unlock(true, n2);
2153 err_free1:
2154 bkey_put(b->c, &n1->key);
2155 btree_node_free(n1);
2156 rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.
Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 19:59:58 +08:00
|
|
|
/*
|
|
|
|
* If the btree node is selected and flushing in btree_flush_write(),
|
|
|
|
* delay and retry until the BTREE_NODE_journal_flush bit cleared,
|
|
|
|
* then it is safe to free the btree node here. Otherwise this btree
|
|
|
|
* node will be in race condition.
|
|
|
|
*/
|
|
|
|
if (btree_node_journal_flush(b)) {
|
|
|
|
mutex_unlock(&b->write_lock);
|
|
|
|
pr_debug("bnode %p journal_flush set, retry", b);
|
|
|
|
udelay(1);
|
|
|
|
goto retry;
|
|
|
|
}
|
2014-03-05 08:42:42 +08:00
|
|
|
|
2019-06-28 19:59:55 +08:00
|
|
|
if (btree_node_dirty(b)) {
|
2013-03-24 07:11:31 +08:00
|
|
|
btree_complete_write(b, btree_current_write(b));
|
2019-06-28 19:59:55 +08:00
|
|
|
clear_bit(BTREE_NODE_dirty, &b->flags);
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&b->write_lock);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
cancel_delayed_work(&b->work);
|
|
|
|
|
|
|
|
mutex_lock(&b->c->bucket_lock);
|
|
|
|
bch_bucket_free(b->c, &b->key);
|
|
|
|
mca_bucket_free(b);
|
|
|
|
mutex_unlock(&b->c->bucket_lock);
|
|
|
|
}
|
|
|
|
|
2014-04-22 09:23:12 +08:00
|
|
|
struct btree *__bch_btree_node_alloc(struct cache_set *c, struct btree_op *op,
|
2014-07-12 15:22:53 +08:00
|
|
|
int level, bool wait,
|
|
|
|
struct btree *parent)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
BKEY_PADDED(key) k;
|
|
|
|
struct btree *b = ERR_PTR(-EAGAIN);
|
|
|
|
|
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
retry:
|
2014-04-22 09:23:12 +08:00
|
|
|
if (__bch_bucket_alloc_set(c, RESERVE_BTREE, &k.key, 1, wait))
|
2013-03-24 07:11:31 +08:00
|
|
|
goto err;
|
|
|
|
|
2013-07-25 07:46:42 +08:00
|
|
|
bkey_put(c, &k.key);
|
2013-03-24 07:11:31 +08:00
|
|
|
SET_KEY_SIZE(&k.key, c->btree_pages * PAGE_SECTORS);
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
b = mca_alloc(c, op, &k.key, level);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (IS_ERR(b))
|
|
|
|
goto err_free;
|
|
|
|
|
|
|
|
if (!b) {
|
2013-03-26 02:46:44 +08:00
|
|
|
cache_bug(c,
|
|
|
|
"Tried to allocate bucket that was in btree cache");
|
2013-03-24 07:11:31 +08:00
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
|
|
|
|
b->accessed = 1;
|
2014-07-12 15:22:53 +08:00
|
|
|
b->parent = parent;
|
2013-12-21 09:28:16 +08:00
|
|
|
bch_bset_init_next(&b->keys, b->keys.set->data, bset_magic(&b->c->sb));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
2013-04-27 06:39:55 +08:00
|
|
|
|
|
|
|
trace_bcache_btree_node_alloc(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
return b;
|
|
|
|
err_free:
|
|
|
|
bch_bucket_free(c, &k.key);
|
|
|
|
err:
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
2013-04-27 06:39:55 +08:00
|
|
|
|
2014-05-24 02:18:35 +08:00
|
|
|
trace_bcache_btree_node_alloc_fail(c);
|
2013-03-24 07:11:31 +08:00
|
|
|
return b;
|
|
|
|
}
|
|
|
|
|
2014-04-22 09:23:12 +08:00
|
|
|
static struct btree *bch_btree_node_alloc(struct cache_set *c,
|
2014-07-12 15:22:53 +08:00
|
|
|
struct btree_op *op, int level,
|
|
|
|
struct btree *parent)
|
2014-04-22 09:23:12 +08:00
|
|
|
{
|
2014-07-12 15:22:53 +08:00
|
|
|
return __bch_btree_node_alloc(c, op, level, op != NULL, parent);
|
2014-04-22 09:23:12 +08:00
|
|
|
}
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
static struct btree *btree_node_alloc_replacement(struct btree *b,
|
|
|
|
struct btree_op *op)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2014-07-12 15:22:53 +08:00
|
|
|
struct btree *n = bch_btree_node_alloc(b->c, op, b->level, b->parent);
|
2018-08-11 13:19:45 +08:00
|
|
|
|
2013-09-11 13:53:34 +08:00
|
|
|
if (!IS_ERR_OR_NULL(n)) {
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&n->write_lock);
|
2013-11-12 10:38:51 +08:00
|
|
|
bch_btree_sort_into(&b->keys, &n->keys, &b->c->sort);
|
2013-09-11 13:53:34 +08:00
|
|
|
bkey_copy_key(&n->key, &b->key);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&n->write_lock);
|
2013-09-11 13:53:34 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return n;
|
|
|
|
}
|
|
|
|
|
2013-07-25 14:18:05 +08:00
|
|
|
static void make_btree_freeing_key(struct btree *b, struct bkey *k)
|
|
|
|
{
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i;
|
2013-07-25 14:18:05 +08:00
|
|
|
|
2014-03-18 09:22:34 +08:00
|
|
|
mutex_lock(&b->c->bucket_lock);
|
|
|
|
|
|
|
|
atomic_inc(&b->c->prio_blocked);
|
|
|
|
|
2013-07-25 14:18:05 +08:00
|
|
|
bkey_copy(k, &b->key);
|
|
|
|
bkey_copy_key(k, &ZERO_KEY);
|
|
|
|
|
2014-03-18 09:22:34 +08:00
|
|
|
for (i = 0; i < KEY_PTRS(k); i++)
|
|
|
|
SET_PTR_GEN(k, i,
|
|
|
|
bch_inc_gen(PTR_CACHE(b->c, &b->key, i),
|
|
|
|
PTR_BUCKET(b->c, &b->key, i)));
|
2013-07-25 14:18:05 +08:00
|
|
|
|
2014-03-18 09:22:34 +08:00
|
|
|
mutex_unlock(&b->c->bucket_lock);
|
2013-07-25 14:18:05 +08:00
|
|
|
}
|
|
|
|
|
2013-12-17 17:29:34 +08:00
|
|
|
static int btree_check_reserve(struct btree *b, struct btree_op *op)
|
|
|
|
{
|
|
|
|
struct cache_set *c = b->c;
|
|
|
|
struct cache *ca;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i, reserve = (c->root->level - b->level) * 2 + 1;
|
2013-12-17 17:29:34 +08:00
|
|
|
|
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
|
|
|
|
for_each_cache(ca, c, i)
|
|
|
|
if (fifo_used(&ca->free[RESERVE_BTREE]) < reserve) {
|
|
|
|
if (op)
|
2014-03-18 08:15:53 +08:00
|
|
|
prepare_to_wait(&c->btree_cache_wait, &op->wait,
|
2013-12-17 17:29:34 +08:00
|
|
|
TASK_UNINTERRUPTIBLE);
|
2014-03-18 08:15:53 +08:00
|
|
|
mutex_unlock(&c->bucket_lock);
|
|
|
|
return -EINTR;
|
2013-12-17 17:29:34 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
2014-03-18 08:15:53 +08:00
|
|
|
|
|
|
|
return mca_cannibalize_lock(b->c, op);
|
2013-12-17 17:29:34 +08:00
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Garbage collection */
|
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
static uint8_t __bch_btree_mark_key(struct cache_set *c, int level,
|
|
|
|
struct bkey *k)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
uint8_t stale = 0;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i;
|
2013-03-24 07:11:31 +08:00
|
|
|
struct bucket *g;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ptr_invalid() can't return true for the keys that mark btree nodes as
|
|
|
|
* freed, but since ptr_bad() returns true we'll never actually use them
|
|
|
|
* for anything and thus we don't want mark their pointers here
|
|
|
|
*/
|
|
|
|
if (!bkey_cmp(k, &ZERO_KEY))
|
|
|
|
return stale;
|
|
|
|
|
|
|
|
for (i = 0; i < KEY_PTRS(k); i++) {
|
|
|
|
if (!ptr_available(c, k, i))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
g = PTR_BUCKET(c, k, i);
|
|
|
|
|
2014-02-28 09:51:12 +08:00
|
|
|
if (gen_after(g->last_gc, PTR_GEN(k, i)))
|
|
|
|
g->last_gc = PTR_GEN(k, i);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (ptr_stale(c, k, i)) {
|
|
|
|
stale = max(stale, ptr_stale(c, k, i));
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
cache_bug_on(GC_MARK(g) &&
|
|
|
|
(GC_MARK(g) == GC_MARK_METADATA) != (level != 0),
|
|
|
|
c, "inconsistent ptrs: mark = %llu, level = %i",
|
|
|
|
GC_MARK(g), level);
|
|
|
|
|
|
|
|
if (level)
|
|
|
|
SET_GC_MARK(g, GC_MARK_METADATA);
|
|
|
|
else if (KEY_DIRTY(k))
|
|
|
|
SET_GC_MARK(g, GC_MARK_DIRTY);
|
2014-03-14 04:46:29 +08:00
|
|
|
else if (!GC_MARK(g))
|
|
|
|
SET_GC_MARK(g, GC_MARK_RECLAIMABLE);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
/* guard against overflow */
|
2018-08-11 13:19:44 +08:00
|
|
|
SET_GC_SECTORS_USED(g, min_t(unsigned int,
|
2013-03-24 07:11:31 +08:00
|
|
|
GC_SECTORS_USED(g) + KEY_SIZE(k),
|
bcache: fix BUG_ON due to integer overflow with GC_SECTORS_USED
The BUG_ON at the end of __bch_btree_mark_key can be triggered due to
an integer overflow error:
BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 13);
...
SET_GC_SECTORS_USED(g, min_t(unsigned,
GC_SECTORS_USED(g) + KEY_SIZE(k),
(1 << 14) - 1));
BUG_ON(!GC_SECTORS_USED(g));
In bcache.h, the SECTORS_USED bitfield is defined to be 13 bits wide.
While the SET_ code tries to ensure that the field doesn't overflow by
clamping it to (1<<14)-1 == 16383, this is incorrect because 16383
requires 14 bits. Therefore, if GC_SECTORS_USED() + KEY_SIZE() =
8192, the SET_ statement tries to store 8192 into a 13-bit field. In
a 13-bit field, 8192 becomes zero, thus triggering the BUG_ON.
Therefore, create a field width constant and a max value constant, and
use those to create the bitfield and check the inputs to
SET_GC_SECTORS_USED. Arguably the BITMASK() template ought to have
BUG_ON checks for too-large values, but that's a separate patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2014-01-29 08:57:39 +08:00
|
|
|
MAX_GC_SECTORS_USED));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
BUG_ON(!GC_SECTORS_USED(g));
|
|
|
|
}
|
|
|
|
|
|
|
|
return stale;
|
|
|
|
}
|
|
|
|
|
|
|
|
#define btree_mark_key(b, k) __bch_btree_mark_key(b->c, b->level, k)
|
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
void bch_initial_mark_key(struct cache_set *c, int level, struct bkey *k)
|
|
|
|
{
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i;
|
2014-03-18 06:13:26 +08:00
|
|
|
|
|
|
|
for (i = 0; i < KEY_PTRS(k); i++)
|
|
|
|
if (ptr_available(c, k, i) &&
|
|
|
|
!ptr_stale(c, k, i)) {
|
|
|
|
struct bucket *b = PTR_BUCKET(c, k, i);
|
|
|
|
|
|
|
|
b->gen = PTR_GEN(k, i);
|
|
|
|
|
|
|
|
if (level && bkey_cmp(k, &ZERO_KEY))
|
|
|
|
b->prio = BTREE_PRIO;
|
|
|
|
else if (!level && b->prio == BTREE_PRIO)
|
|
|
|
b->prio = INITIAL_PRIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
__bch_btree_mark_key(c, level, k);
|
|
|
|
}
|
|
|
|
|
2017-10-31 05:46:33 +08:00
|
|
|
void bch_update_bucket_in_use(struct cache_set *c, struct gc_stat *stats)
|
|
|
|
{
|
|
|
|
stats->in_use = (c->nbuckets - c->avail_nbuckets) * 100 / c->nbuckets;
|
|
|
|
}
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
static bool btree_gc_mark_node(struct btree *b, struct gc_stat *gc)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
uint8_t stale = 0;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int keys = 0, good_keys = 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
struct bkey *k;
|
|
|
|
struct btree_iter iter;
|
|
|
|
struct bset_tree *t;
|
|
|
|
|
|
|
|
gc->nodes++;
|
|
|
|
|
2013-11-12 09:35:24 +08:00
|
|
|
for_each_key_filter(&b->keys, k, &iter, bch_ptr_invalid) {
|
2013-03-24 07:11:31 +08:00
|
|
|
stale = max(stale, btree_mark_key(b, k));
|
2013-09-11 10:07:00 +08:00
|
|
|
keys++;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
if (bch_ptr_bad(&b->keys, k))
|
2013-03-24 07:11:31 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
gc->key_bytes += bkey_u64s(k);
|
|
|
|
gc->nkeys++;
|
2013-09-11 10:07:00 +08:00
|
|
|
good_keys++;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
gc->data += KEY_SIZE(k);
|
|
|
|
}
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
for (t = b->keys.set; t <= &b->keys.set[b->keys.nsets]; t++)
|
2013-03-24 07:11:31 +08:00
|
|
|
btree_bug_on(t->size &&
|
2013-12-21 09:28:16 +08:00
|
|
|
bset_written(&b->keys, t) &&
|
2013-03-24 07:11:31 +08:00
|
|
|
bkey_cmp(&b->key, &t->end) < 0,
|
|
|
|
b, "found short btree key in gc");
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (b->c->gc_always_rewrite)
|
|
|
|
return true;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (stale > 10)
|
|
|
|
return true;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if ((keys - good_keys) * 2 > keys)
|
|
|
|
return true;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
return false;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
#define GC_MERGE_NODES 4U
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
struct gc_merge_info {
|
|
|
|
struct btree *b;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int keys;
|
2013-03-24 07:11:31 +08:00
|
|
|
};
|
|
|
|
|
2018-08-11 13:19:46 +08:00
|
|
|
static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
|
|
|
|
struct keylist *insert_keys,
|
|
|
|
atomic_t *journal_ref,
|
|
|
|
struct bkey *replace_key);
|
2013-09-11 10:07:00 +08:00
|
|
|
|
|
|
|
static int btree_gc_coalesce(struct btree *b, struct btree_op *op,
|
2014-03-18 08:15:53 +08:00
|
|
|
struct gc_stat *gc, struct gc_merge_info *r)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i, nodes = 0, keys = 0, blocks;
|
2013-09-11 10:07:00 +08:00
|
|
|
struct btree *new_nodes[GC_MERGE_NODES];
|
2014-03-18 08:15:53 +08:00
|
|
|
struct keylist keylist;
|
2013-07-25 09:04:18 +08:00
|
|
|
struct closure cl;
|
2013-09-11 10:07:00 +08:00
|
|
|
struct bkey *k;
|
2013-07-25 09:04:18 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
bch_keylist_init(&keylist);
|
|
|
|
|
|
|
|
if (btree_check_reserve(b, NULL))
|
|
|
|
return 0;
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
memset(new_nodes, 0, sizeof(new_nodes));
|
2013-07-25 09:04:18 +08:00
|
|
|
closure_init_stack(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
while (nodes < GC_MERGE_NODES && !IS_ERR_OR_NULL(r[nodes].b))
|
2013-03-24 07:11:31 +08:00
|
|
|
keys += r[nodes++].keys;
|
|
|
|
|
|
|
|
blocks = btree_default_blocks(b->c) * 2 / 3;
|
|
|
|
|
|
|
|
if (nodes < 2 ||
|
2013-12-21 09:28:16 +08:00
|
|
|
__set_blocks(b->keys.set[0].data, keys,
|
2013-12-18 15:49:49 +08:00
|
|
|
block_bytes(b->c)) > blocks * (nodes - 1))
|
2013-09-11 10:07:00 +08:00
|
|
|
return 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
for (i = 0; i < nodes; i++) {
|
2014-03-18 08:15:53 +08:00
|
|
|
new_nodes[i] = btree_node_alloc_replacement(r[i].b, NULL);
|
2013-09-11 10:07:00 +08:00
|
|
|
if (IS_ERR_OR_NULL(new_nodes[i]))
|
|
|
|
goto out_nocoalesce;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
/*
|
|
|
|
* We have to check the reserve here, after we've allocated our new
|
|
|
|
* nodes, to make sure the insert below will succeed - we also check
|
|
|
|
* before as an optimization to potentially avoid a bunch of expensive
|
|
|
|
* allocs/sorts
|
|
|
|
*/
|
|
|
|
if (btree_check_reserve(b, NULL))
|
|
|
|
goto out_nocoalesce;
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
for (i = 0; i < nodes; i++)
|
|
|
|
mutex_lock(&new_nodes[i]->write_lock);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
for (i = nodes - 1; i > 0; --i) {
|
2013-12-18 15:49:49 +08:00
|
|
|
struct bset *n1 = btree_bset_first(new_nodes[i]);
|
|
|
|
struct bset *n2 = btree_bset_first(new_nodes[i - 1]);
|
2013-03-24 07:11:31 +08:00
|
|
|
struct bkey *k, *last = NULL;
|
|
|
|
|
|
|
|
keys = 0;
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (i > 1) {
|
|
|
|
for (k = n2->start;
|
2013-12-18 13:56:21 +08:00
|
|
|
k < bset_bkey_last(n2);
|
2013-09-11 10:07:00 +08:00
|
|
|
k = bkey_next(k)) {
|
|
|
|
if (__set_blocks(n1, n1->keys + keys +
|
2013-12-18 15:49:49 +08:00
|
|
|
bkey_u64s(k),
|
|
|
|
block_bytes(b->c)) > blocks)
|
2013-09-11 10:07:00 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
last = k;
|
|
|
|
keys += bkey_u64s(k);
|
|
|
|
}
|
|
|
|
} else {
|
2013-03-24 07:11:31 +08:00
|
|
|
/*
|
|
|
|
* Last node we're not getting rid of - we're getting
|
|
|
|
* rid of the node at r[0]. Have to try and fit all of
|
|
|
|
* the remaining keys into this node; we can't ensure
|
|
|
|
* they will always fit due to rounding and variable
|
|
|
|
* length keys (shouldn't be possible in practice,
|
|
|
|
* though)
|
|
|
|
*/
|
2013-09-11 10:07:00 +08:00
|
|
|
if (__set_blocks(n1, n1->keys + n2->keys,
|
2013-12-18 15:49:49 +08:00
|
|
|
block_bytes(b->c)) >
|
|
|
|
btree_blocks(new_nodes[i]))
|
2013-09-11 10:07:00 +08:00
|
|
|
goto out_nocoalesce;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
keys = n2->keys;
|
2013-09-11 10:07:00 +08:00
|
|
|
/* Take the key of the node we're getting rid of */
|
2013-03-24 07:11:31 +08:00
|
|
|
last = &r->b->key;
|
2013-09-11 10:07:00 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
BUG_ON(__set_blocks(n1, n1->keys + keys, block_bytes(b->c)) >
|
|
|
|
btree_blocks(new_nodes[i]));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (last)
|
|
|
|
bkey_copy_key(&new_nodes[i]->key, last);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 13:56:21 +08:00
|
|
|
memcpy(bset_bkey_last(n1),
|
2013-03-24 07:11:31 +08:00
|
|
|
n2->start,
|
2013-12-18 13:56:21 +08:00
|
|
|
(void *) bset_bkey_idx(n2, keys) - (void *) n2->start);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
n1->keys += keys;
|
2013-09-11 10:07:00 +08:00
|
|
|
r[i].keys = n1->keys;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
memmove(n2->start,
|
2013-12-18 13:56:21 +08:00
|
|
|
bset_bkey_idx(n2, keys),
|
|
|
|
(void *) bset_bkey_last(n2) -
|
|
|
|
(void *) bset_bkey_idx(n2, keys));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
n2->keys -= keys;
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
if (__bch_keylist_realloc(&keylist,
|
2013-11-12 10:20:51 +08:00
|
|
|
bkey_u64s(&new_nodes[i]->key)))
|
2013-09-11 10:07:00 +08:00
|
|
|
goto out_nocoalesce;
|
|
|
|
|
|
|
|
bch_btree_node_write(new_nodes[i], &cl);
|
2014-03-18 08:15:53 +08:00
|
|
|
bch_keylist_add(&keylist, &new_nodes[i]->key);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
for (i = 0; i < nodes; i++)
|
|
|
|
mutex_unlock(&new_nodes[i]->write_lock);
|
|
|
|
|
2014-03-18 09:22:34 +08:00
|
|
|
closure_sync(&cl);
|
|
|
|
|
|
|
|
/* We emptied out this node */
|
|
|
|
BUG_ON(btree_bset_first(new_nodes[0])->keys);
|
|
|
|
btree_node_free(new_nodes[0]);
|
|
|
|
rw_unlock(true, new_nodes[0]);
|
2014-07-13 12:53:11 +08:00
|
|
|
new_nodes[0] = NULL;
|
2014-03-18 09:22:34 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
for (i = 0; i < nodes; i++) {
|
2014-03-18 08:15:53 +08:00
|
|
|
if (__bch_keylist_realloc(&keylist, bkey_u64s(&r[i].b->key)))
|
2013-09-11 10:07:00 +08:00
|
|
|
goto out_nocoalesce;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
make_btree_freeing_key(r[i].b, keylist.top);
|
|
|
|
bch_keylist_push(&keylist);
|
2013-09-11 10:07:00 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
bch_btree_insert_node(b, op, &keylist, NULL, NULL);
|
|
|
|
BUG_ON(!bch_keylist_empty(&keylist));
|
2013-09-11 10:07:00 +08:00
|
|
|
|
|
|
|
for (i = 0; i < nodes; i++) {
|
|
|
|
btree_node_free(r[i].b);
|
|
|
|
rw_unlock(true, r[i].b);
|
|
|
|
|
|
|
|
r[i].b = new_nodes[i];
|
|
|
|
}
|
|
|
|
|
|
|
|
memmove(r, r + 1, sizeof(r[0]) * (nodes - 1));
|
|
|
|
r[nodes - 1].b = ERR_PTR(-EINTR);
|
|
|
|
|
|
|
|
trace_bcache_btree_gc_coalesce(nodes);
|
2013-03-24 07:11:31 +08:00
|
|
|
gc->nodes--;
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
bch_keylist_free(&keylist);
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
/* Invalidated our iterator */
|
|
|
|
return -EINTR;
|
|
|
|
|
|
|
|
out_nocoalesce:
|
|
|
|
closure_sync(&cl);
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
while ((k = bch_keylist_pop(&keylist)))
|
2013-09-11 10:07:00 +08:00
|
|
|
if (!bkey_cmp(k, &ZERO_KEY))
|
|
|
|
atomic_dec(&b->c->prio_blocked);
|
2019-04-25 00:48:42 +08:00
|
|
|
bch_keylist_free(&keylist);
|
2013-09-11 10:07:00 +08:00
|
|
|
|
|
|
|
for (i = 0; i < nodes; i++)
|
|
|
|
if (!IS_ERR_OR_NULL(new_nodes[i])) {
|
|
|
|
btree_node_free(new_nodes[i]);
|
|
|
|
rw_unlock(true, new_nodes[i]);
|
|
|
|
}
|
|
|
|
return 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
static int btree_gc_rewrite_node(struct btree *b, struct btree_op *op,
|
|
|
|
struct btree *replace)
|
|
|
|
{
|
|
|
|
struct keylist keys;
|
|
|
|
struct btree *n;
|
|
|
|
|
|
|
|
if (btree_check_reserve(b, NULL))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
n = btree_node_alloc_replacement(replace, NULL);
|
|
|
|
|
|
|
|
/* recheck reserve after allocating replacement node */
|
|
|
|
if (btree_check_reserve(b, NULL)) {
|
|
|
|
btree_node_free(n);
|
|
|
|
rw_unlock(true, n);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
bch_btree_node_write_sync(n);
|
|
|
|
|
|
|
|
bch_keylist_init(&keys);
|
|
|
|
bch_keylist_add(&keys, &n->key);
|
|
|
|
|
|
|
|
make_btree_freeing_key(replace, keys.top);
|
|
|
|
bch_keylist_push(&keys);
|
|
|
|
|
|
|
|
bch_btree_insert_node(b, op, &keys, NULL, NULL);
|
|
|
|
BUG_ON(!bch_keylist_empty(&keys));
|
|
|
|
|
|
|
|
btree_node_free(replace);
|
|
|
|
rw_unlock(true, n);
|
|
|
|
|
|
|
|
/* Invalidated our iterator */
|
|
|
|
return -EINTR;
|
|
|
|
}
|
|
|
|
|
2018-08-11 13:19:44 +08:00
|
|
|
static unsigned int btree_gc_count_keys(struct btree *b)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-09-11 10:07:00 +08:00
|
|
|
struct bkey *k;
|
|
|
|
struct btree_iter iter;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int ret = 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-11-12 09:35:24 +08:00
|
|
|
for_each_key_filter(&b->keys, k, &iter, bch_ptr_bad)
|
2013-09-11 10:07:00 +08:00
|
|
|
ret += bkey_u64s(k);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
bcache: calculate the number of incremental GC nodes according to the total of btree nodes
This patch base on "[PATCH] bcache: finish incremental GC".
Since incremental GC would stop 100ms when front side I/O comes, so when
there are many btree nodes, if GC only processes constant (100) nodes each
time, GC would last a long time, and the front I/Os would run out of the
buckets (since no new bucket can be allocated during GC), and I/Os be
blocked again.
So GC should not process constant nodes, but varied nodes according to the
number of btree nodes. In this patch, GC is divided into constant (100)
times, so when there are many btree nodes, GC can process more nodes each
time, otherwise GC will process less nodes each time (but no less than
MIN_GC_NODES).
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 12:17:35 +08:00
|
|
|
static size_t btree_gc_min_nodes(struct cache_set *c)
|
|
|
|
{
|
|
|
|
size_t min_nodes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since incremental GC would stop 100ms when front
|
|
|
|
* side I/O comes, so when there are many btree nodes,
|
|
|
|
* if GC only processes constant (100) nodes each time,
|
|
|
|
* GC would last a long time, and the front side I/Os
|
|
|
|
* would run out of the buckets (since no new bucket
|
|
|
|
* can be allocated during GC), and be blocked again.
|
|
|
|
* So GC should not process constant nodes, but varied
|
|
|
|
* nodes according to the number of btree nodes, which
|
|
|
|
* realized by dividing GC into constant(100) times,
|
|
|
|
* so when there are many btree nodes, GC can process
|
|
|
|
* more nodes each time, otherwise, GC will process less
|
|
|
|
* nodes each time (but no less than MIN_GC_NODES)
|
|
|
|
*/
|
|
|
|
min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
|
|
|
|
if (min_nodes < MIN_GC_NODES)
|
|
|
|
min_nodes = MIN_GC_NODES;
|
|
|
|
|
|
|
|
return min_nodes;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
static int btree_gc_recurse(struct btree *b, struct btree_op *op,
|
|
|
|
struct closure *writes, struct gc_stat *gc)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
bool should_rewrite;
|
|
|
|
struct bkey *k;
|
|
|
|
struct btree_iter iter;
|
2013-03-24 07:11:31 +08:00
|
|
|
struct gc_merge_info r[GC_MERGE_NODES];
|
2014-03-05 08:42:42 +08:00
|
|
|
struct gc_merge_info *i, *last = r + ARRAY_SIZE(r) - 1;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-11-12 09:35:24 +08:00
|
|
|
bch_btree_iter_init(&b->keys, &iter, &b->c->gc_done);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
for (i = r; i < r + ARRAY_SIZE(r); i++)
|
|
|
|
i->b = ERR_PTR(-EINTR);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
while (1) {
|
2013-12-21 09:28:16 +08:00
|
|
|
k = bch_btree_iter_next_filter(&iter, &b->keys, bch_ptr_bad);
|
2013-09-11 10:07:00 +08:00
|
|
|
if (k) {
|
2014-03-18 08:15:53 +08:00
|
|
|
r->b = bch_btree_node_get(b->c, op, k, b->level - 1,
|
2014-07-12 15:22:53 +08:00
|
|
|
true, b);
|
2013-09-11 10:07:00 +08:00
|
|
|
if (IS_ERR(r->b)) {
|
|
|
|
ret = PTR_ERR(r->b);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
r->keys = btree_gc_count_keys(r->b);
|
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
ret = btree_gc_coalesce(b, op, gc, r);
|
2013-09-11 10:07:00 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (!last->b)
|
|
|
|
break;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (!IS_ERR(last->b)) {
|
|
|
|
should_rewrite = btree_gc_mark_node(last->b, gc);
|
2014-03-18 08:15:53 +08:00
|
|
|
if (should_rewrite) {
|
|
|
|
ret = btree_gc_rewrite_node(b, op, last->b);
|
|
|
|
if (ret)
|
2013-09-11 10:07:00 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (last->b->level) {
|
|
|
|
ret = btree_gc_recurse(last->b, op, writes, gc);
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
bkey_copy_key(&b->c->gc_done, &last->b->key);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Must flush leaf nodes before gc ends, since replace
|
|
|
|
* operations aren't journalled
|
|
|
|
*/
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&last->b->write_lock);
|
2013-09-11 10:07:00 +08:00
|
|
|
if (btree_node_dirty(last->b))
|
|
|
|
bch_btree_node_write(last->b, writes);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&last->b->write_lock);
|
2013-09-11 10:07:00 +08:00
|
|
|
rw_unlock(true, last->b);
|
|
|
|
}
|
|
|
|
|
|
|
|
memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
|
|
|
|
r->b = NULL;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
bcache: finish incremental GC
In GC thread, we record the latest GC key in gc_done, which is expected
to be used for incremental GC, but in currently code, we didn't realize
it. When GC runs, front side IO would be blocked until the GC over, it
would be a long time if there is a lot of btree nodes.
This patch realizes incremental GC, the main ideal is that, when there
are front side I/Os, after GC some nodes (100), we stop GC, release locker
of the btree node, and go to process the front side I/Os for some times
(100 ms), then go back to GC again.
By this patch, when we doing GC, I/Os are not blocked all the time, and
there is no obvious I/Os zero jump problem any more.
Patch v2: Rename some variables and macros name as Coly suggested.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 12:17:34 +08:00
|
|
|
if (atomic_read(&b->c->search_inflight) &&
|
bcache: calculate the number of incremental GC nodes according to the total of btree nodes
This patch base on "[PATCH] bcache: finish incremental GC".
Since incremental GC would stop 100ms when front side I/O comes, so when
there are many btree nodes, if GC only processes constant (100) nodes each
time, GC would last a long time, and the front I/Os would run out of the
buckets (since no new bucket can be allocated during GC), and I/Os be
blocked again.
So GC should not process constant nodes, but varied nodes according to the
number of btree nodes. In this patch, GC is divided into constant (100)
times, so when there are many btree nodes, GC can process more nodes each
time, otherwise GC will process less nodes each time (but no less than
MIN_GC_NODES).
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 12:17:35 +08:00
|
|
|
gc->nodes >= gc->nodes_pre + btree_gc_min_nodes(b->c)) {
|
bcache: finish incremental GC
In GC thread, we record the latest GC key in gc_done, which is expected
to be used for incremental GC, but in currently code, we didn't realize
it. When GC runs, front side IO would be blocked until the GC over, it
would be a long time if there is a lot of btree nodes.
This patch realizes incremental GC, the main ideal is that, when there
are front side I/Os, after GC some nodes (100), we stop GC, release locker
of the btree node, and go to process the front side I/Os for some times
(100 ms), then go back to GC again.
By this patch, when we doing GC, I/Os are not blocked all the time, and
there is no obvious I/Os zero jump problem any more.
Patch v2: Rename some variables and macros name as Coly suggested.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 12:17:34 +08:00
|
|
|
gc->nodes_pre = gc->nodes;
|
|
|
|
ret = -EAGAIN;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
if (need_resched()) {
|
|
|
|
ret = -EAGAIN;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
for (i = r; i < r + ARRAY_SIZE(r); i++)
|
|
|
|
if (!IS_ERR_OR_NULL(i->b)) {
|
|
|
|
mutex_lock(&i->b->write_lock);
|
|
|
|
if (btree_node_dirty(i->b))
|
|
|
|
bch_btree_node_write(i->b, writes);
|
|
|
|
mutex_unlock(&i->b->write_lock);
|
|
|
|
rw_unlock(true, i->b);
|
2013-09-11 10:07:00 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int bch_btree_gc_root(struct btree *b, struct btree_op *op,
|
|
|
|
struct closure *writes, struct gc_stat *gc)
|
|
|
|
{
|
|
|
|
struct btree *n = NULL;
|
2013-09-11 10:07:00 +08:00
|
|
|
int ret = 0;
|
|
|
|
bool should_rewrite;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
should_rewrite = btree_gc_mark_node(b, gc);
|
|
|
|
if (should_rewrite) {
|
2014-03-18 08:15:53 +08:00
|
|
|
n = btree_node_alloc_replacement(b, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (!IS_ERR_OR_NULL(n)) {
|
|
|
|
bch_btree_node_write_sync(n);
|
2014-03-05 08:42:42 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
bch_btree_set_root(n);
|
|
|
|
btree_node_free(b);
|
|
|
|
rw_unlock(true, n);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
return -EINTR;
|
|
|
|
}
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
__bch_btree_mark_key(b->c, b->level + 1, &b->key);
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
if (b->level) {
|
|
|
|
ret = btree_gc_recurse(b, op, writes, gc);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 10:07:00 +08:00
|
|
|
bkey_copy_key(&b->c->gc_done, &b->key);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void btree_gc_start(struct cache_set *c)
|
|
|
|
{
|
|
|
|
struct cache *ca;
|
|
|
|
struct bucket *b;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (!c->gc_mark_valid)
|
|
|
|
return;
|
|
|
|
|
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
|
|
|
|
c->gc_mark_valid = 0;
|
|
|
|
c->gc_done = ZERO_KEY;
|
|
|
|
|
|
|
|
for_each_cache(ca, c, i)
|
|
|
|
for_each_bucket(b, ca) {
|
2014-02-28 09:51:12 +08:00
|
|
|
b->last_gc = b->gen;
|
2013-07-12 10:43:21 +08:00
|
|
|
if (!atomic_read(&b->pin)) {
|
2014-03-14 04:46:29 +08:00
|
|
|
SET_GC_MARK(b, 0);
|
2013-07-12 10:43:21 +08:00
|
|
|
SET_GC_SECTORS_USED(b, 0);
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
|
|
|
}
|
|
|
|
|
2017-10-31 05:46:33 +08:00
|
|
|
static void bch_btree_gc_finish(struct cache_set *c)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct bucket *b;
|
|
|
|
struct cache *ca;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
|
|
|
|
set_gc_sectors(c);
|
|
|
|
c->gc_mark_valid = 1;
|
|
|
|
c->need_gc = 0;
|
|
|
|
|
|
|
|
for (i = 0; i < KEY_PTRS(&c->uuid_bucket); i++)
|
|
|
|
SET_GC_MARK(PTR_BUCKET(c, &c->uuid_bucket, i),
|
|
|
|
GC_MARK_METADATA);
|
|
|
|
|
2013-11-27 11:14:23 +08:00
|
|
|
/* don't reclaim buckets to which writeback keys point */
|
|
|
|
rcu_read_lock();
|
2018-01-09 04:21:28 +08:00
|
|
|
for (i = 0; i < c->devices_max_used; i++) {
|
2013-11-27 11:14:23 +08:00
|
|
|
struct bcache_device *d = c->devices[i];
|
|
|
|
struct cached_dev *dc;
|
|
|
|
struct keybuf_key *w, *n;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int j;
|
2013-11-27 11:14:23 +08:00
|
|
|
|
|
|
|
if (!d || UUID_FLASH_ONLY(&c->uuids[i]))
|
|
|
|
continue;
|
|
|
|
dc = container_of(d, struct cached_dev, disk);
|
|
|
|
|
|
|
|
spin_lock(&dc->writeback_keys.lock);
|
|
|
|
rbtree_postorder_for_each_entry_safe(w, n,
|
|
|
|
&dc->writeback_keys.keys, node)
|
|
|
|
for (j = 0; j < KEY_PTRS(&w->key); j++)
|
|
|
|
SET_GC_MARK(PTR_BUCKET(c, &w->key, j),
|
|
|
|
GC_MARK_DIRTY);
|
|
|
|
spin_unlock(&dc->writeback_keys.lock);
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
2017-10-31 05:46:33 +08:00
|
|
|
c->avail_nbuckets = 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
for_each_cache(ca, c, i) {
|
|
|
|
uint64_t *i;
|
|
|
|
|
|
|
|
ca->invalidate_needs_gc = 0;
|
|
|
|
|
|
|
|
for (i = ca->sb.d; i < ca->sb.d + ca->sb.keys; i++)
|
|
|
|
SET_GC_MARK(ca->buckets + *i, GC_MARK_METADATA);
|
|
|
|
|
|
|
|
for (i = ca->prio_buckets;
|
|
|
|
i < ca->prio_buckets + prio_buckets(ca) * 2; i++)
|
|
|
|
SET_GC_MARK(ca->buckets + *i, GC_MARK_METADATA);
|
|
|
|
|
|
|
|
for_each_bucket(b, ca) {
|
|
|
|
c->need_gc = max(c->need_gc, bucket_gc_gen(b));
|
|
|
|
|
2014-03-14 04:46:29 +08:00
|
|
|
if (atomic_read(&b->pin))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
BUG_ON(!GC_MARK(b) && GC_SECTORS_USED(b));
|
|
|
|
|
|
|
|
if (!GC_MARK(b) || GC_MARK(b) == GC_MARK_RECLAIMABLE)
|
2017-10-31 05:46:33 +08:00
|
|
|
c->avail_nbuckets++;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
|
|
|
}
|
|
|
|
|
2013-10-25 08:19:26 +08:00
|
|
|
static void bch_btree_gc(struct cache_set *c)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct gc_stat stats;
|
|
|
|
struct closure writes;
|
|
|
|
struct btree_op op;
|
|
|
|
uint64_t start_time = local_clock();
|
2013-04-26 04:58:35 +08:00
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_gc_start(c);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
memset(&stats, 0, sizeof(struct gc_stat));
|
|
|
|
closure_init_stack(&writes);
|
2013-07-25 09:04:18 +08:00
|
|
|
bch_btree_op_init(&op, SHRT_MAX);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
btree_gc_start(c);
|
|
|
|
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
/* if CACHE_SET_IO_DISABLE set, gc thread should stop too */
|
2013-09-11 10:07:00 +08:00
|
|
|
do {
|
|
|
|
ret = btree_root(gc_root, c, &op, &writes, &stats);
|
|
|
|
closure_sync(&writes);
|
2015-11-30 09:18:33 +08:00
|
|
|
cond_resched();
|
2013-03-24 07:11:31 +08:00
|
|
|
|
bcache: finish incremental GC
In GC thread, we record the latest GC key in gc_done, which is expected
to be used for incremental GC, but in currently code, we didn't realize
it. When GC runs, front side IO would be blocked until the GC over, it
would be a long time if there is a lot of btree nodes.
This patch realizes incremental GC, the main ideal is that, when there
are front side I/Os, after GC some nodes (100), we stop GC, release locker
of the btree node, and go to process the front side I/Os for some times
(100 ms), then go back to GC again.
By this patch, when we doing GC, I/Os are not blocked all the time, and
there is no obvious I/Os zero jump problem any more.
Patch v2: Rename some variables and macros name as Coly suggested.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 12:17:34 +08:00
|
|
|
if (ret == -EAGAIN)
|
|
|
|
schedule_timeout_interruptible(msecs_to_jiffies
|
|
|
|
(GC_SLEEP_MS));
|
|
|
|
else if (ret)
|
2013-09-11 10:07:00 +08:00
|
|
|
pr_warn("gc failed!");
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
} while (ret && !test_bit(CACHE_SET_IO_DISABLE, &c->flags));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2017-10-31 05:46:33 +08:00
|
|
|
bch_btree_gc_finish(c);
|
2013-04-26 04:58:35 +08:00
|
|
|
wake_up_allocators(c);
|
|
|
|
|
2013-03-29 02:50:55 +08:00
|
|
|
bch_time_stats_update(&c->btree_gc_time, start_time);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
stats.key_bytes *= sizeof(uint64_t);
|
|
|
|
stats.data <<= 9;
|
2017-10-31 05:46:33 +08:00
|
|
|
bch_update_bucket_in_use(c, &stats);
|
2013-03-24 07:11:31 +08:00
|
|
|
memcpy(&c->gc_stats, &stats, sizeof(struct gc_stat));
|
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_gc_end(c);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-10-25 08:19:26 +08:00
|
|
|
bch_moving_gc(c);
|
|
|
|
}
|
|
|
|
|
2016-10-27 11:31:17 +08:00
|
|
|
static bool gc_should_run(struct cache_set *c)
|
2013-10-25 08:19:26 +08:00
|
|
|
{
|
2013-09-11 10:07:00 +08:00
|
|
|
struct cache *ca;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i;
|
2013-10-25 08:19:26 +08:00
|
|
|
|
2016-10-27 11:31:17 +08:00
|
|
|
for_each_cache(ca, c, i)
|
|
|
|
if (ca->invalidate_needs_gc)
|
|
|
|
return true;
|
2013-10-25 08:19:26 +08:00
|
|
|
|
2016-10-27 11:31:17 +08:00
|
|
|
if (atomic_read(&c->sectors_to_gc) < 0)
|
|
|
|
return true;
|
2013-10-25 08:19:26 +08:00
|
|
|
|
2016-10-27 11:31:17 +08:00
|
|
|
return false;
|
|
|
|
}
|
2013-09-11 10:07:00 +08:00
|
|
|
|
2016-10-27 11:31:17 +08:00
|
|
|
static int bch_gc_thread(void *arg)
|
|
|
|
{
|
|
|
|
struct cache_set *c = arg;
|
2013-09-11 10:07:00 +08:00
|
|
|
|
2016-10-27 11:31:17 +08:00
|
|
|
while (1) {
|
|
|
|
wait_event_interruptible(c->gc_wait,
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
kthread_should_stop() ||
|
|
|
|
test_bit(CACHE_SET_IO_DISABLE, &c->flags) ||
|
|
|
|
gc_should_run(c));
|
2013-09-11 10:07:00 +08:00
|
|
|
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
if (kthread_should_stop() ||
|
|
|
|
test_bit(CACHE_SET_IO_DISABLE, &c->flags))
|
2016-10-27 11:31:17 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
set_gc_sectors(c);
|
|
|
|
bch_btree_gc(c);
|
2013-10-25 08:19:26 +08:00
|
|
|
}
|
|
|
|
|
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.
Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.
The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.
Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.
A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.
Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19 08:36:17 +08:00
|
|
|
wait_for_kthread_stop();
|
2013-10-25 08:19:26 +08:00
|
|
|
return 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-10-25 08:19:26 +08:00
|
|
|
int bch_gc_thread_start(struct cache_set *c)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2016-10-27 11:31:17 +08:00
|
|
|
c->gc_thread = kthread_run(bch_gc_thread, c, "bcache_gc");
|
2018-01-09 04:21:20 +08:00
|
|
|
return PTR_ERR_OR_ZERO(c->gc_thread);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Initial partial gc */
|
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
static int bch_btree_check_recurse(struct btree *b, struct btree_op *op)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-09-11 08:18:59 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct bkey *k, *p = NULL;
|
2013-03-24 07:11:31 +08:00
|
|
|
struct btree_iter iter;
|
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
for_each_key_filter(&b->keys, k, &iter, bch_ptr_invalid)
|
|
|
|
bch_initial_mark_key(b->c, b->level, k);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
bch_initial_mark_key(b->c, b->level + 1, &b->key);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (b->level) {
|
2013-11-12 09:35:24 +08:00
|
|
|
bch_btree_iter_init(&b->keys, &iter, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 08:18:59 +08:00
|
|
|
do {
|
2013-12-21 09:28:16 +08:00
|
|
|
k = bch_btree_iter_next_filter(&iter, &b->keys,
|
|
|
|
bch_ptr_bad);
|
bcache: calculate the number of incremental GC nodes according to the total of btree nodes
This patch base on "[PATCH] bcache: finish incremental GC".
Since incremental GC would stop 100ms when front side I/O comes, so when
there are many btree nodes, if GC only processes constant (100) nodes each
time, GC would last a long time, and the front I/Os would run out of the
buckets (since no new bucket can be allocated during GC), and I/Os be
blocked again.
So GC should not process constant nodes, but varied nodes according to the
number of btree nodes. In this patch, GC is divided into constant (100)
times, so when there are many btree nodes, GC can process more nodes each
time, otherwise GC will process less nodes each time (but no less than
MIN_GC_NODES).
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 12:17:35 +08:00
|
|
|
if (k) {
|
2014-07-12 15:22:53 +08:00
|
|
|
btree_node_prefetch(b, k);
|
bcache: calculate the number of incremental GC nodes according to the total of btree nodes
This patch base on "[PATCH] bcache: finish incremental GC".
Since incremental GC would stop 100ms when front side I/O comes, so when
there are many btree nodes, if GC only processes constant (100) nodes each
time, GC would last a long time, and the front I/Os would run out of the
buckets (since no new bucket can be allocated during GC), and I/Os be
blocked again.
So GC should not process constant nodes, but varied nodes according to the
number of btree nodes. In this patch, GC is divided into constant (100)
times, so when there are many btree nodes, GC can process more nodes each
time, otherwise GC will process less nodes each time (but no less than
MIN_GC_NODES).
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 12:17:35 +08:00
|
|
|
/*
|
|
|
|
* initiallize c->gc_stats.nodes
|
|
|
|
* for incremental GC
|
|
|
|
*/
|
|
|
|
b->c->gc_stats.nodes++;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 08:18:59 +08:00
|
|
|
if (p)
|
2014-03-18 06:13:26 +08:00
|
|
|
ret = btree(check_recurse, p, b, op);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 08:18:59 +08:00
|
|
|
p = k;
|
|
|
|
} while (p && !ret);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
return ret;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-25 08:44:17 +08:00
|
|
|
int bch_btree_check(struct cache_set *c)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-07-25 08:44:17 +08:00
|
|
|
struct btree_op op;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:04:18 +08:00
|
|
|
bch_btree_op_init(&op, SHRT_MAX);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 06:13:26 +08:00
|
|
|
return btree_root(check_recurse, c, &op);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-18 07:55:55 +08:00
|
|
|
void bch_initial_gc_finish(struct cache_set *c)
|
|
|
|
{
|
|
|
|
struct cache *ca;
|
|
|
|
struct bucket *b;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i;
|
2014-03-18 07:55:55 +08:00
|
|
|
|
|
|
|
bch_btree_gc_finish(c);
|
|
|
|
|
|
|
|
mutex_lock(&c->bucket_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to put some unused buckets directly on the prio freelist in
|
|
|
|
* order to get the allocator thread started - it needs freed buckets in
|
|
|
|
* order to rewrite the prios and gens, and it needs to rewrite prios
|
|
|
|
* and gens in order to free buckets.
|
|
|
|
*
|
|
|
|
* This is only safe for buckets that have no live data in them, which
|
|
|
|
* there should always be some of.
|
|
|
|
*/
|
|
|
|
for_each_cache(ca, c, i) {
|
|
|
|
for_each_bucket(b, ca) {
|
bcache: fix for allocator and register thread race
After long time running of random small IO writing,
I reboot the machine, and after the machine power on,
I found bcache got stuck, the stack is:
[root@ceph153 ~]# cat /proc/2510/task/*/stack
[<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
[<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
[<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
[<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
[<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[<ffffffff810a631f>] kthread+0xcf/0xe0
[<ffffffff8164c318>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff
[root@ceph153 ~]# cat /proc/2038/task/*/stack
[<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
[<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
[<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
[<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
[<ffffffff812f702f>] kobj_attr_store+0xf/0x20
[<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
[<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
[<ffffffff811e069f>] SyS_write+0x7f/0xe0
[<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
The stack shows the register thread and allocator thread
were getting stuck when registering cache device.
I reboot the machine several times, the issue always
exsit in this machine.
I debug the code, and found the call trace as bellow:
register_bcache()
==>run_cache_set()
==>bch_journal_replay()
==>bch_btree_insert()
==>__bch_btree_map_nodes()
==>btree_insert_fn()
==>btree_split() //node need split
==>btree_check_reserve()
In btree_check_reserve(), It will check if there is enough buckets
of RESERVE_BTREE type, since allocator thread did not work yet, so
no buckets of RESERVE_BTREE type allocated, so the register thread
waits on c->btree_cache_wait, and goes to sleep.
Then the allocator thread initialized, the call trace is bellow:
bch_allocator_thread()
==>bch_prio_write()
==>bch_journal_meta()
==>bch_journal()
==>journal_wait_for_write()
In journal_wait_for_write(), It will check if journal is full by
journal_full(), but the long time random small IO writing
causes the exhaustion of journal buckets(journal.blocks_free=0),
In order to release the journal buckets,
the allocator calls btree_flush_write() to flush keys to
btree nodes, and waits on c->journal.wait until btree nodes writing
over or there has already some journal buckets space, then the
allocator thread goes to sleep. but in btree_flush_write(), since
bch_journal_replay() is not finished, so no btree nodes have journal
(condition "if (btree_current_write(b)->journal)" never satisfied),
so we got no btree node to flush, no journal bucket released,
and allocator sleep all the times.
Through the above analysis, we can see that:
1) Register thread wait for allocator thread to allocate buckets of
RESERVE_BTREE type;
2) Alloctor thread wait for register thread to replay journal, so it
can flush btree nodes and get journal bucket.
then they are all got stuck by waiting for each other.
Hua Rui provided a patch for me, by allocating some buckets of
RESERVE_BTREE type in advance, so the register thread can get bucket
when btree node splitting and no need to waiting for the allocator
thread. I tested it, it has effect, and register thread run a step
forward, but finally are still got stuck, the reason is only 8 bucket
of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
then btree_check_reserve() is not satisfied anymore, so it goes to sleep
again, and in the same time, alloctor thread did not flush enough btree
nodes to release a journal bucket, so they all got stuck again.
So we need to allocate more buckets of RESERVE_BTREE type in advance,
but how much is enough? By experience and test, I think it should be
as much as journal buckets. Then I modify the code as this patch,
and test in the machine, and it works.
This patch modified base on Hua Rui’s patch, and allocate more buckets
of RESERVE_BTREE type in advance to avoid register thread and allocate
thread going to wait for each other.
[patch v2] ca->sb.njournal_buckets would be 0 in the first time after
cache creation, and no journal exists, so just 8 btree buckets is OK.
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-08 03:41:43 +08:00
|
|
|
if (fifo_full(&ca->free[RESERVE_PRIO]) &&
|
|
|
|
fifo_full(&ca->free[RESERVE_BTREE]))
|
2014-03-18 07:55:55 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
if (bch_can_invalidate_bucket(ca, b) &&
|
|
|
|
!GC_MARK(b)) {
|
|
|
|
__bch_invalidate_one_bucket(ca, b);
|
bcache: fix for allocator and register thread race
After long time running of random small IO writing,
I reboot the machine, and after the machine power on,
I found bcache got stuck, the stack is:
[root@ceph153 ~]# cat /proc/2510/task/*/stack
[<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
[<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
[<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
[<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
[<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[<ffffffff810a631f>] kthread+0xcf/0xe0
[<ffffffff8164c318>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff
[root@ceph153 ~]# cat /proc/2038/task/*/stack
[<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
[<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
[<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
[<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
[<ffffffff812f702f>] kobj_attr_store+0xf/0x20
[<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
[<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
[<ffffffff811e069f>] SyS_write+0x7f/0xe0
[<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
The stack shows the register thread and allocator thread
were getting stuck when registering cache device.
I reboot the machine several times, the issue always
exsit in this machine.
I debug the code, and found the call trace as bellow:
register_bcache()
==>run_cache_set()
==>bch_journal_replay()
==>bch_btree_insert()
==>__bch_btree_map_nodes()
==>btree_insert_fn()
==>btree_split() //node need split
==>btree_check_reserve()
In btree_check_reserve(), It will check if there is enough buckets
of RESERVE_BTREE type, since allocator thread did not work yet, so
no buckets of RESERVE_BTREE type allocated, so the register thread
waits on c->btree_cache_wait, and goes to sleep.
Then the allocator thread initialized, the call trace is bellow:
bch_allocator_thread()
==>bch_prio_write()
==>bch_journal_meta()
==>bch_journal()
==>journal_wait_for_write()
In journal_wait_for_write(), It will check if journal is full by
journal_full(), but the long time random small IO writing
causes the exhaustion of journal buckets(journal.blocks_free=0),
In order to release the journal buckets,
the allocator calls btree_flush_write() to flush keys to
btree nodes, and waits on c->journal.wait until btree nodes writing
over or there has already some journal buckets space, then the
allocator thread goes to sleep. but in btree_flush_write(), since
bch_journal_replay() is not finished, so no btree nodes have journal
(condition "if (btree_current_write(b)->journal)" never satisfied),
so we got no btree node to flush, no journal bucket released,
and allocator sleep all the times.
Through the above analysis, we can see that:
1) Register thread wait for allocator thread to allocate buckets of
RESERVE_BTREE type;
2) Alloctor thread wait for register thread to replay journal, so it
can flush btree nodes and get journal bucket.
then they are all got stuck by waiting for each other.
Hua Rui provided a patch for me, by allocating some buckets of
RESERVE_BTREE type in advance, so the register thread can get bucket
when btree node splitting and no need to waiting for the allocator
thread. I tested it, it has effect, and register thread run a step
forward, but finally are still got stuck, the reason is only 8 bucket
of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
then btree_check_reserve() is not satisfied anymore, so it goes to sleep
again, and in the same time, alloctor thread did not flush enough btree
nodes to release a journal bucket, so they all got stuck again.
So we need to allocate more buckets of RESERVE_BTREE type in advance,
but how much is enough? By experience and test, I think it should be
as much as journal buckets. Then I modify the code as this patch,
and test in the machine, and it works.
This patch modified base on Hua Rui’s patch, and allocate more buckets
of RESERVE_BTREE type in advance to avoid register thread and allocate
thread going to wait for each other.
[patch v2] ca->sb.njournal_buckets would be 0 in the first time after
cache creation, and no journal exists, so just 8 btree buckets is OK.
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-08 03:41:43 +08:00
|
|
|
if (!fifo_push(&ca->free[RESERVE_PRIO],
|
|
|
|
b - ca->buckets))
|
|
|
|
fifo_push(&ca->free[RESERVE_BTREE],
|
|
|
|
b - ca->buckets);
|
2014-03-18 07:55:55 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&c->bucket_lock);
|
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Btree insertion */
|
|
|
|
|
2013-11-12 09:02:31 +08:00
|
|
|
static bool btree_insert_key(struct btree *b, struct bkey *k,
|
|
|
|
struct bkey *replace_key)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int status;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
BUG_ON(bkey_cmp(k, &b->key) > 0);
|
2013-11-11 13:55:27 +08:00
|
|
|
|
2013-11-12 09:02:31 +08:00
|
|
|
status = bch_btree_insert_key(&b->keys, k, replace_key);
|
|
|
|
if (status != BTREE_INSERT_STATUS_NO_INSERT) {
|
|
|
|
bch_check_keys(&b->keys, "%u for %s", status,
|
|
|
|
replace_key ? "replace" : "insert");
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-11-12 09:02:31 +08:00
|
|
|
trace_bcache_btree_insert_key(b, k, replace_key != NULL,
|
|
|
|
status);
|
|
|
|
return true;
|
|
|
|
} else
|
|
|
|
return false;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-11-12 11:03:54 +08:00
|
|
|
static size_t insert_u64s_remaining(struct btree *b)
|
|
|
|
{
|
2014-01-11 10:53:02 +08:00
|
|
|
long ret = bch_btree_keys_u64s_remaining(&b->keys);
|
2013-11-12 11:03:54 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Might land in the middle of an existing extent and have to split it
|
|
|
|
*/
|
|
|
|
if (b->keys.ops->is_extents)
|
|
|
|
ret -= KEY_MAX_U64S;
|
|
|
|
|
|
|
|
return max(ret, 0L);
|
|
|
|
}
|
|
|
|
|
2013-09-11 09:41:15 +08:00
|
|
|
static bool bch_btree_insert_keys(struct btree *b, struct btree_op *op,
|
2013-09-11 09:52:54 +08:00
|
|
|
struct keylist *insert_keys,
|
|
|
|
struct bkey *replace_key)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
bool ret = false;
|
2013-12-18 15:47:33 +08:00
|
|
|
int oldsize = bch_count_data(&b->keys);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:41:15 +08:00
|
|
|
while (!bch_keylist_empty(insert_keys)) {
|
2013-07-25 08:24:25 +08:00
|
|
|
struct bkey *k = insert_keys->keys;
|
2013-09-11 09:41:15 +08:00
|
|
|
|
2013-11-12 11:03:54 +08:00
|
|
|
if (bkey_u64s(k) > insert_u64s_remaining(b))
|
2013-07-25 08:22:44 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
if (bkey_cmp(k, &b->key) <= 0) {
|
2013-07-25 07:46:42 +08:00
|
|
|
if (!b->level)
|
|
|
|
bkey_put(b->c, k);
|
2013-09-11 09:41:15 +08:00
|
|
|
|
2013-11-12 09:02:31 +08:00
|
|
|
ret |= btree_insert_key(b, k, replace_key);
|
2013-09-11 09:41:15 +08:00
|
|
|
bch_keylist_pop_front(insert_keys);
|
|
|
|
} else if (bkey_cmp(&START_KEY(k), &b->key) < 0) {
|
|
|
|
BKEY_PADDED(key) temp;
|
2013-07-25 08:24:25 +08:00
|
|
|
bkey_copy(&temp.key, insert_keys->keys);
|
2013-09-11 09:41:15 +08:00
|
|
|
|
|
|
|
bch_cut_back(&b->key, &temp.key);
|
2013-07-25 08:24:25 +08:00
|
|
|
bch_cut_front(&b->key, insert_keys->keys);
|
2013-09-11 09:41:15 +08:00
|
|
|
|
2013-11-12 09:02:31 +08:00
|
|
|
ret |= btree_insert_key(b, &temp.key, replace_key);
|
2013-09-11 09:41:15 +08:00
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
break;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-11-12 09:02:31 +08:00
|
|
|
if (!ret)
|
|
|
|
op->insert_collision = true;
|
|
|
|
|
2013-07-25 08:22:44 +08:00
|
|
|
BUG_ON(!bch_keylist_empty(insert_keys) && b->level);
|
|
|
|
|
2013-12-18 15:47:33 +08:00
|
|
|
BUG_ON(bch_count_data(&b->keys) < oldsize);
|
2013-03-24 07:11:31 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-09-11 09:41:15 +08:00
|
|
|
static int btree_split(struct btree *b, struct btree_op *op,
|
|
|
|
struct keylist *insert_keys,
|
2013-09-11 09:52:54 +08:00
|
|
|
struct bkey *replace_key)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-07-25 08:20:19 +08:00
|
|
|
bool split;
|
2013-03-24 07:11:31 +08:00
|
|
|
struct btree *n1, *n2 = NULL, *n3 = NULL;
|
|
|
|
uint64_t start_time = local_clock();
|
2013-07-25 09:04:18 +08:00
|
|
|
struct closure cl;
|
2013-07-27 03:32:38 +08:00
|
|
|
struct keylist parent_keys;
|
2013-07-25 09:04:18 +08:00
|
|
|
|
|
|
|
closure_init_stack(&cl);
|
2013-07-27 03:32:38 +08:00
|
|
|
bch_keylist_init(&parent_keys);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
if (btree_check_reserve(b, op)) {
|
|
|
|
if (!b->level)
|
|
|
|
return -EINTR;
|
|
|
|
else
|
|
|
|
WARN(1, "insufficient reserve for split\n");
|
|
|
|
}
|
2013-12-17 17:29:34 +08:00
|
|
|
|
2014-03-18 08:15:53 +08:00
|
|
|
n1 = btree_node_alloc_replacement(b, op);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (IS_ERR(n1))
|
|
|
|
goto err;
|
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
split = set_blocks(btree_bset_first(n1),
|
|
|
|
block_bytes(n1->c)) > (btree_blocks(b) * 4) / 5;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (split) {
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int keys = 0;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
trace_bcache_btree_node_split(b, btree_bset_first(n1)->keys);
|
2013-04-27 06:39:55 +08:00
|
|
|
|
2014-07-12 15:22:53 +08:00
|
|
|
n2 = bch_btree_node_alloc(b->c, op, b->level, b->parent);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (IS_ERR(n2))
|
|
|
|
goto err_free1;
|
|
|
|
|
2013-07-25 08:20:19 +08:00
|
|
|
if (!b->parent) {
|
2014-07-12 15:22:53 +08:00
|
|
|
n3 = bch_btree_node_alloc(b->c, op, b->level + 1, NULL);
|
2013-03-24 07:11:31 +08:00
|
|
|
if (IS_ERR(n3))
|
|
|
|
goto err_free2;
|
|
|
|
}
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&n1->write_lock);
|
|
|
|
mutex_lock(&n2->write_lock);
|
|
|
|
|
2013-09-11 09:52:54 +08:00
|
|
|
bch_btree_insert_keys(n1, op, insert_keys, replace_key);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 08:20:19 +08:00
|
|
|
/*
|
|
|
|
* Has to be a linear search because we don't have an auxiliary
|
2013-03-24 07:11:31 +08:00
|
|
|
* search tree yet
|
|
|
|
*/
|
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
while (keys < (btree_bset_first(n1)->keys * 3) / 5)
|
|
|
|
keys += bkey_u64s(bset_bkey_idx(btree_bset_first(n1),
|
2013-12-18 13:56:21 +08:00
|
|
|
keys));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 13:56:21 +08:00
|
|
|
bkey_copy_key(&n1->key,
|
2013-12-18 15:49:49 +08:00
|
|
|
bset_bkey_idx(btree_bset_first(n1), keys));
|
|
|
|
keys += bkey_u64s(bset_bkey_idx(btree_bset_first(n1), keys));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
btree_bset_first(n2)->keys = btree_bset_first(n1)->keys - keys;
|
|
|
|
btree_bset_first(n1)->keys = keys;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
memcpy(btree_bset_first(n2)->start,
|
|
|
|
bset_bkey_last(btree_bset_first(n1)),
|
|
|
|
btree_bset_first(n2)->keys * sizeof(uint64_t));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
bkey_copy_key(&n2->key, &b->key);
|
|
|
|
|
2013-07-27 03:32:38 +08:00
|
|
|
bch_keylist_add(&parent_keys, &n2->key);
|
2013-07-25 09:04:18 +08:00
|
|
|
bch_btree_node_write(n2, &cl);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&n2->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
rw_unlock(true, n2);
|
2013-04-27 06:39:55 +08:00
|
|
|
} else {
|
2013-12-18 15:49:49 +08:00
|
|
|
trace_bcache_btree_node_compact(b, btree_bset_first(n1)->keys);
|
2013-04-27 06:39:55 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&n1->write_lock);
|
2013-09-11 09:52:54 +08:00
|
|
|
bch_btree_insert_keys(n1, op, insert_keys, replace_key);
|
2013-04-27 06:39:55 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-27 03:32:38 +08:00
|
|
|
bch_keylist_add(&parent_keys, &n1->key);
|
2013-07-25 09:04:18 +08:00
|
|
|
bch_btree_node_write(n1, &cl);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&n1->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
if (n3) {
|
2013-07-25 08:20:19 +08:00
|
|
|
/* Depth increases, make a new root */
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_lock(&n3->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
bkey_copy_key(&n3->key, &MAX_KEY);
|
2013-07-27 03:32:38 +08:00
|
|
|
bch_btree_insert_keys(n3, op, &parent_keys, NULL);
|
2013-07-25 09:04:18 +08:00
|
|
|
bch_btree_node_write(n3, &cl);
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&n3->write_lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:04:18 +08:00
|
|
|
closure_sync(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
bch_btree_set_root(n3);
|
|
|
|
rw_unlock(true, n3);
|
2013-07-25 08:20:19 +08:00
|
|
|
} else if (!b->parent) {
|
|
|
|
/* Root filled up but didn't need to be split */
|
2013-07-25 09:04:18 +08:00
|
|
|
closure_sync(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
bch_btree_set_root(n1);
|
|
|
|
} else {
|
2013-07-27 03:32:38 +08:00
|
|
|
/* Split a non root node */
|
2013-07-25 09:04:18 +08:00
|
|
|
closure_sync(&cl);
|
2013-07-27 03:32:38 +08:00
|
|
|
make_btree_freeing_key(b, parent_keys.top);
|
|
|
|
bch_keylist_push(&parent_keys);
|
|
|
|
|
|
|
|
bch_btree_insert_node(b->parent, op, &parent_keys, NULL, NULL);
|
|
|
|
BUG_ON(!bch_keylist_empty(&parent_keys));
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2014-03-18 09:22:34 +08:00
|
|
|
btree_node_free(b);
|
2013-03-24 07:11:31 +08:00
|
|
|
rw_unlock(true, n1);
|
|
|
|
|
2013-03-29 02:50:55 +08:00
|
|
|
bch_time_stats_update(&b->c->btree_split_time, start_time);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
err_free2:
|
2013-12-17 08:38:49 +08:00
|
|
|
bkey_put(b->c, &n2->key);
|
2013-07-25 08:27:07 +08:00
|
|
|
btree_node_free(n2);
|
2013-03-24 07:11:31 +08:00
|
|
|
rw_unlock(true, n2);
|
|
|
|
err_free1:
|
2013-12-17 08:38:49 +08:00
|
|
|
bkey_put(b->c, &n1->key);
|
2013-07-25 08:27:07 +08:00
|
|
|
btree_node_free(n1);
|
2013-03-24 07:11:31 +08:00
|
|
|
rw_unlock(true, n1);
|
|
|
|
err:
|
2014-03-18 08:15:53 +08:00
|
|
|
WARN(1, "bcache: btree split failed (level %u)", b->level);
|
2013-12-17 08:38:49 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
if (n3 == ERR_PTR(-EAGAIN) ||
|
|
|
|
n2 == ERR_PTR(-EAGAIN) ||
|
|
|
|
n1 == ERR_PTR(-EAGAIN))
|
|
|
|
return -EAGAIN;
|
|
|
|
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2013-09-11 09:41:15 +08:00
|
|
|
static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
|
2013-07-25 08:44:17 +08:00
|
|
|
struct keylist *insert_keys,
|
2013-09-11 09:52:54 +08:00
|
|
|
atomic_t *journal_ref,
|
|
|
|
struct bkey *replace_key)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2014-03-05 08:42:42 +08:00
|
|
|
struct closure cl;
|
|
|
|
|
2013-07-27 03:32:38 +08:00
|
|
|
BUG_ON(b->level && replace_key);
|
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
closure_init_stack(&cl);
|
|
|
|
|
|
|
|
mutex_lock(&b->write_lock);
|
|
|
|
|
|
|
|
if (write_block(b) != btree_bset_last(b) &&
|
|
|
|
b->keys.last_set_unwritten)
|
|
|
|
bch_btree_init_next(b); /* just wrote a set */
|
|
|
|
|
2013-11-12 11:03:54 +08:00
|
|
|
if (bch_keylist_nkeys(insert_keys) > insert_u64s_remaining(b)) {
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&b->write_lock);
|
|
|
|
goto split;
|
|
|
|
}
|
2013-12-07 19:57:58 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
BUG_ON(write_block(b) != btree_bset_last(b));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
if (bch_btree_insert_keys(b, op, insert_keys, replace_key)) {
|
|
|
|
if (!b->level)
|
|
|
|
bch_btree_leaf_dirty(b, journal_ref);
|
|
|
|
else
|
|
|
|
bch_btree_node_write(b, &cl);
|
|
|
|
}
|
2013-07-27 03:32:38 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
mutex_unlock(&b->write_lock);
|
|
|
|
|
|
|
|
/* wait for btree node write if necessary, after unlock */
|
|
|
|
closure_sync(&cl);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
split:
|
|
|
|
if (current->bio_list) {
|
|
|
|
op->lock = b->c->root->level + 1;
|
|
|
|
return -EAGAIN;
|
|
|
|
} else if (op->lock <= b->c->root->level) {
|
|
|
|
op->lock = b->c->root->level + 1;
|
|
|
|
return -EINTR;
|
|
|
|
} else {
|
|
|
|
/* Invalidated all iterators */
|
|
|
|
int ret = btree_split(b, op, insert_keys, replace_key);
|
|
|
|
|
|
|
|
if (bch_keylist_empty(insert_keys))
|
|
|
|
return 0;
|
|
|
|
else if (!ret)
|
|
|
|
return -EINTR;
|
|
|
|
return ret;
|
2013-07-27 03:32:38 +08:00
|
|
|
}
|
2013-09-11 09:41:15 +08:00
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:39:16 +08:00
|
|
|
int bch_btree_insert_check_key(struct btree *b, struct btree_op *op,
|
|
|
|
struct bkey *check_key)
|
|
|
|
{
|
|
|
|
int ret = -EINTR;
|
|
|
|
uint64_t btree_ptr = b->key.ptr[0];
|
|
|
|
unsigned long seq = b->seq;
|
|
|
|
struct keylist insert;
|
|
|
|
bool upgrade = op->lock == -1;
|
|
|
|
|
|
|
|
bch_keylist_init(&insert);
|
|
|
|
|
|
|
|
if (upgrade) {
|
|
|
|
rw_unlock(false, b);
|
|
|
|
rw_lock(true, b, b->level);
|
|
|
|
|
|
|
|
if (b->key.ptr[0] != btree_ptr ||
|
2018-08-11 13:19:50 +08:00
|
|
|
b->seq != seq + 1) {
|
2018-03-19 08:36:26 +08:00
|
|
|
op->lock = b->level;
|
2013-09-11 09:39:16 +08:00
|
|
|
goto out;
|
2018-08-11 13:19:50 +08:00
|
|
|
}
|
2013-09-11 09:39:16 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
SET_KEY_PTRS(check_key, 1);
|
|
|
|
get_random_bytes(&check_key->ptr[0], sizeof(uint64_t));
|
|
|
|
|
|
|
|
SET_PTR_DEV(check_key, 0, PTR_CHECK_DEV);
|
|
|
|
|
|
|
|
bch_keylist_add(&insert, check_key);
|
|
|
|
|
2013-09-11 09:52:54 +08:00
|
|
|
ret = bch_btree_insert_node(b, op, &insert, NULL, NULL);
|
2013-09-11 09:39:16 +08:00
|
|
|
|
|
|
|
BUG_ON(!ret && !bch_keylist_empty(&insert));
|
|
|
|
out:
|
|
|
|
if (upgrade)
|
|
|
|
downgrade_write(&b->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
struct btree_insert_op {
|
|
|
|
struct btree_op op;
|
|
|
|
struct keylist *keys;
|
|
|
|
atomic_t *journal_ref;
|
|
|
|
struct bkey *replace_key;
|
|
|
|
};
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-11-28 10:31:35 +08:00
|
|
|
static int btree_insert_fn(struct btree_op *b_op, struct btree *b)
|
2013-07-25 09:07:22 +08:00
|
|
|
{
|
|
|
|
struct btree_insert_op *op = container_of(b_op,
|
|
|
|
struct btree_insert_op, op);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
int ret = bch_btree_insert_node(b, &op->op, op->keys,
|
|
|
|
op->journal_ref, op->replace_key);
|
|
|
|
if (ret && !bch_keylist_empty(op->keys))
|
|
|
|
return ret;
|
|
|
|
else
|
|
|
|
return MAP_DONE;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
int bch_btree_insert(struct cache_set *c, struct keylist *keys,
|
|
|
|
atomic_t *journal_ref, struct bkey *replace_key)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
2013-07-25 09:07:22 +08:00
|
|
|
struct btree_insert_op op;
|
2013-03-24 07:11:31 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
BUG_ON(current->bio_list);
|
2013-09-11 09:46:36 +08:00
|
|
|
BUG_ON(bch_keylist_empty(keys));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
bch_btree_op_init(&op.op, 0);
|
|
|
|
op.keys = keys;
|
|
|
|
op.journal_ref = journal_ref;
|
|
|
|
op.replace_key = replace_key;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
while (!ret && !bch_keylist_empty(keys)) {
|
|
|
|
op.op.lock = 0;
|
|
|
|
ret = bch_btree_map_leaf_nodes(&op.op, c,
|
|
|
|
&START_KEY(keys->keys),
|
|
|
|
btree_insert_fn);
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
if (ret) {
|
|
|
|
struct bkey *k;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
pr_err("error %i", ret);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-07-25 09:07:22 +08:00
|
|
|
while ((k = bch_keylist_pop(keys)))
|
2013-07-25 07:46:42 +08:00
|
|
|
bkey_put(c, k);
|
2013-07-25 09:07:22 +08:00
|
|
|
} else if (op.op.insert_collision)
|
|
|
|
ret = -ESRCH;
|
2013-07-25 09:06:22 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bch_btree_set_root(struct btree *b)
|
|
|
|
{
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int i;
|
2013-06-27 08:25:38 +08:00
|
|
|
struct closure cl;
|
|
|
|
|
|
|
|
closure_init_stack(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-27 06:39:55 +08:00
|
|
|
trace_bcache_btree_set_root(b);
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
BUG_ON(!b->written);
|
|
|
|
|
|
|
|
for (i = 0; i < KEY_PTRS(&b->key); i++)
|
|
|
|
BUG_ON(PTR_BUCKET(b->c, &b->key, i)->prio != BTREE_PRIO);
|
|
|
|
|
|
|
|
mutex_lock(&b->c->bucket_lock);
|
|
|
|
list_del_init(&b->list);
|
|
|
|
mutex_unlock(&b->c->bucket_lock);
|
|
|
|
|
|
|
|
b->c->root = b;
|
|
|
|
|
2013-06-27 08:25:38 +08:00
|
|
|
bch_journal_meta(b->c, &cl);
|
|
|
|
closure_sync(&cl);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
/* Map across nodes or keys */
|
|
|
|
|
|
|
|
static int bch_btree_map_nodes_recurse(struct btree *b, struct btree_op *op,
|
|
|
|
struct bkey *from,
|
|
|
|
btree_map_nodes_fn *fn, int flags)
|
|
|
|
{
|
|
|
|
int ret = MAP_CONTINUE;
|
|
|
|
|
|
|
|
if (b->level) {
|
|
|
|
struct bkey *k;
|
|
|
|
struct btree_iter iter;
|
|
|
|
|
2013-11-12 09:35:24 +08:00
|
|
|
bch_btree_iter_init(&b->keys, &iter, from);
|
2013-09-11 09:48:51 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
while ((k = bch_btree_iter_next_filter(&iter, &b->keys,
|
2013-09-11 09:48:51 +08:00
|
|
|
bch_ptr_bad))) {
|
|
|
|
ret = btree(map_nodes_recurse, k, b,
|
|
|
|
op, from, fn, flags);
|
|
|
|
from = NULL;
|
|
|
|
|
|
|
|
if (ret != MAP_CONTINUE)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!b->level || flags == MAP_ALL_NODES)
|
|
|
|
ret = fn(op, b);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int __bch_btree_map_nodes(struct btree_op *op, struct cache_set *c,
|
|
|
|
struct bkey *from, btree_map_nodes_fn *fn, int flags)
|
|
|
|
{
|
2013-07-25 09:04:18 +08:00
|
|
|
return btree_root(map_nodes_recurse, c, op, from, fn, flags);
|
2013-09-11 09:48:51 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int bch_btree_map_keys_recurse(struct btree *b, struct btree_op *op,
|
|
|
|
struct bkey *from, btree_map_keys_fn *fn,
|
|
|
|
int flags)
|
|
|
|
{
|
|
|
|
int ret = MAP_CONTINUE;
|
|
|
|
struct bkey *k;
|
|
|
|
struct btree_iter iter;
|
|
|
|
|
2013-11-12 09:35:24 +08:00
|
|
|
bch_btree_iter_init(&b->keys, &iter, from);
|
2013-09-11 09:48:51 +08:00
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
while ((k = bch_btree_iter_next_filter(&iter, &b->keys, bch_ptr_bad))) {
|
2013-09-11 09:48:51 +08:00
|
|
|
ret = !b->level
|
|
|
|
? fn(op, b, k)
|
|
|
|
: btree(map_keys_recurse, k, b, op, from, fn, flags);
|
|
|
|
from = NULL;
|
|
|
|
|
|
|
|
if (ret != MAP_CONTINUE)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!b->level && (flags & MAP_END_KEY))
|
|
|
|
ret = fn(op, b, &KEY(KEY_INODE(&b->key),
|
|
|
|
KEY_OFFSET(&b->key), 0));
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int bch_btree_map_keys(struct btree_op *op, struct cache_set *c,
|
|
|
|
struct bkey *from, btree_map_keys_fn *fn, int flags)
|
|
|
|
{
|
2013-07-25 09:04:18 +08:00
|
|
|
return btree_root(map_keys_recurse, c, op, from, fn, flags);
|
2013-09-11 09:48:51 +08:00
|
|
|
}
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Keybuf code */
|
|
|
|
|
|
|
|
static inline int keybuf_cmp(struct keybuf_key *l, struct keybuf_key *r)
|
|
|
|
{
|
|
|
|
/* Overlapping keys compare equal */
|
|
|
|
if (bkey_cmp(&l->key, &START_KEY(&r->key)) <= 0)
|
|
|
|
return -1;
|
|
|
|
if (bkey_cmp(&START_KEY(&l->key), &r->key) >= 0)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int keybuf_nonoverlapping_cmp(struct keybuf_key *l,
|
|
|
|
struct keybuf_key *r)
|
|
|
|
{
|
|
|
|
return clamp_t(int64_t, bkey_cmp(&l->key, &r->key), -1, 1);
|
|
|
|
}
|
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
struct refill {
|
|
|
|
struct btree_op op;
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int nr_found;
|
2013-09-11 09:48:51 +08:00
|
|
|
struct keybuf *buf;
|
|
|
|
struct bkey *end;
|
|
|
|
keybuf_pred_fn *pred;
|
|
|
|
};
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
static int refill_keybuf_fn(struct btree_op *op, struct btree *b,
|
|
|
|
struct bkey *k)
|
|
|
|
{
|
|
|
|
struct refill *refill = container_of(op, struct refill, op);
|
|
|
|
struct keybuf *buf = refill->buf;
|
|
|
|
int ret = MAP_CONTINUE;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
bcache: fix miss key refill->end in writeback
refill->end record the last key of writeback, for example, at the first
time, keys (1,128K) to (1,1024K) are flush to the backend device, but
the end key (1,1024K) is not included, since the bellow code:
if (bkey_cmp(k, refill->end) >= 0) {
ret = MAP_DONE;
goto out;
}
And in the next time when we refill writeback keybuf again, we searched
key start from (1,1024K), and got a key bigger than it, so the key
(1,1024K) missed.
This patch modify the above code, and let the end key to be included to
the writeback key buffer.
Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 20:41:14 +08:00
|
|
|
if (bkey_cmp(k, refill->end) > 0) {
|
2013-09-11 09:48:51 +08:00
|
|
|
ret = MAP_DONE;
|
|
|
|
goto out;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
if (!KEY_SIZE(k)) /* end key */
|
|
|
|
goto out;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
if (refill->pred(buf, k)) {
|
|
|
|
struct keybuf_key *w;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
spin_lock(&buf->lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
w = array_alloc(&buf->freelist);
|
|
|
|
if (!w) {
|
|
|
|
spin_unlock(&buf->lock);
|
|
|
|
return MAP_DONE;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
w->private = NULL;
|
|
|
|
bkey_copy(&w->key, k);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
if (RB_INSERT(&buf->keys, w, node, keybuf_cmp))
|
|
|
|
array_free(&buf->freelist, w);
|
2013-11-01 06:43:22 +08:00
|
|
|
else
|
|
|
|
refill->nr_found++;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
if (array_freelist_empty(&buf->freelist))
|
|
|
|
ret = MAP_DONE;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
spin_unlock(&buf->lock);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
2013-09-11 09:48:51 +08:00
|
|
|
out:
|
|
|
|
buf->last_scanned = *k;
|
|
|
|
return ret;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void bch_refill_keybuf(struct cache_set *c, struct keybuf *buf,
|
2013-06-05 21:24:39 +08:00
|
|
|
struct bkey *end, keybuf_pred_fn *pred)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct bkey start = buf->last_scanned;
|
2013-09-11 09:48:51 +08:00
|
|
|
struct refill refill;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
|
2013-07-25 09:04:18 +08:00
|
|
|
bch_btree_op_init(&refill.op, -1);
|
2013-11-01 06:43:22 +08:00
|
|
|
refill.nr_found = 0;
|
|
|
|
refill.buf = buf;
|
|
|
|
refill.end = end;
|
|
|
|
refill.pred = pred;
|
2013-09-11 09:48:51 +08:00
|
|
|
|
|
|
|
bch_btree_map_keys(&refill.op, c, &buf->last_scanned,
|
|
|
|
refill_keybuf_fn, MAP_END_KEY);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-11-01 06:43:22 +08:00
|
|
|
trace_bcache_keyscan(refill.nr_found,
|
|
|
|
KEY_INODE(&start), KEY_OFFSET(&start),
|
|
|
|
KEY_INODE(&buf->last_scanned),
|
|
|
|
KEY_OFFSET(&buf->last_scanned));
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
spin_lock(&buf->lock);
|
|
|
|
|
|
|
|
if (!RB_EMPTY_ROOT(&buf->keys)) {
|
|
|
|
struct keybuf_key *w;
|
2018-08-11 13:19:45 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
w = RB_FIRST(&buf->keys, struct keybuf_key, node);
|
|
|
|
buf->start = START_KEY(&w->key);
|
|
|
|
|
|
|
|
w = RB_LAST(&buf->keys, struct keybuf_key, node);
|
|
|
|
buf->end = w->key;
|
|
|
|
} else {
|
|
|
|
buf->start = MAX_KEY;
|
|
|
|
buf->end = MAX_KEY;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_unlock(&buf->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __bch_keybuf_del(struct keybuf *buf, struct keybuf_key *w)
|
|
|
|
{
|
|
|
|
rb_erase(&w->node, &buf->keys);
|
|
|
|
array_free(&buf->freelist, w);
|
|
|
|
}
|
|
|
|
|
|
|
|
void bch_keybuf_del(struct keybuf *buf, struct keybuf_key *w)
|
|
|
|
{
|
|
|
|
spin_lock(&buf->lock);
|
|
|
|
__bch_keybuf_del(buf, w);
|
|
|
|
spin_unlock(&buf->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
bool bch_keybuf_check_overlapping(struct keybuf *buf, struct bkey *start,
|
|
|
|
struct bkey *end)
|
|
|
|
{
|
|
|
|
bool ret = false;
|
|
|
|
struct keybuf_key *p, *w, s;
|
2018-08-11 13:19:45 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
s.key = *start;
|
|
|
|
|
|
|
|
if (bkey_cmp(end, &buf->start) <= 0 ||
|
|
|
|
bkey_cmp(start, &buf->end) >= 0)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
spin_lock(&buf->lock);
|
|
|
|
w = RB_GREATER(&buf->keys, s, node, keybuf_nonoverlapping_cmp);
|
|
|
|
|
|
|
|
while (w && bkey_cmp(&START_KEY(&w->key), end) < 0) {
|
|
|
|
p = w;
|
|
|
|
w = RB_NEXT(w, node);
|
|
|
|
|
|
|
|
if (p->private)
|
|
|
|
ret = true;
|
|
|
|
else
|
|
|
|
__bch_keybuf_del(buf, p);
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_unlock(&buf->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct keybuf_key *bch_keybuf_next(struct keybuf *buf)
|
|
|
|
{
|
|
|
|
struct keybuf_key *w;
|
2018-08-11 13:19:45 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
spin_lock(&buf->lock);
|
|
|
|
|
|
|
|
w = RB_FIRST(&buf->keys, struct keybuf_key, node);
|
|
|
|
|
|
|
|
while (w && w->private)
|
|
|
|
w = RB_NEXT(w, node);
|
|
|
|
|
|
|
|
if (w)
|
|
|
|
w->private = ERR_PTR(-EINTR);
|
|
|
|
|
|
|
|
spin_unlock(&buf->lock);
|
|
|
|
return w;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *c,
|
2013-09-11 09:48:51 +08:00
|
|
|
struct keybuf *buf,
|
|
|
|
struct bkey *end,
|
|
|
|
keybuf_pred_fn *pred)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
struct keybuf_key *ret;
|
|
|
|
|
|
|
|
while (1) {
|
|
|
|
ret = bch_keybuf_next(buf);
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (bkey_cmp(&buf->last_scanned, end) >= 0) {
|
|
|
|
pr_debug("scan finished");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2013-06-05 21:24:39 +08:00
|
|
|
bch_refill_keybuf(c, buf, end, pred);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-06-05 21:24:39 +08:00
|
|
|
void bch_keybuf_init(struct keybuf *buf)
|
2013-03-24 07:11:31 +08:00
|
|
|
{
|
|
|
|
buf->last_scanned = MAX_KEY;
|
|
|
|
buf->keys = RB_ROOT;
|
|
|
|
|
|
|
|
spin_lock_init(&buf->lock);
|
|
|
|
array_allocator_init(&buf->freelist);
|
|
|
|
}
|