License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2013-03-24 07:11:31 +08:00
|
|
|
#ifndef _BCACHE_BTREE_H
|
|
|
|
#define _BCACHE_BTREE_H
|
|
|
|
|
|
|
|
/*
|
|
|
|
* THE BTREE:
|
|
|
|
*
|
|
|
|
* At a high level, bcache's btree is relatively standard b+ tree. All keys and
|
|
|
|
* pointers are in the leaves; interior nodes only have pointers to the child
|
|
|
|
* nodes.
|
|
|
|
*
|
|
|
|
* In the interior nodes, a struct bkey always points to a child btree node, and
|
|
|
|
* the key is the highest key in the child node - except that the highest key in
|
|
|
|
* an interior node is always MAX_KEY. The size field refers to the size on disk
|
|
|
|
* of the child node - this would allow us to have variable sized btree nodes
|
|
|
|
* (handy for keeping the depth of the btree 1 by expanding just the root).
|
|
|
|
*
|
|
|
|
* Btree nodes are themselves log structured, but this is hidden fairly
|
|
|
|
* thoroughly. Btree nodes on disk will in practice have extents that overlap
|
|
|
|
* (because they were written at different times), but in memory we never have
|
|
|
|
* overlapping extents - when we read in a btree node from disk, the first thing
|
|
|
|
* we do is resort all the sets of keys with a mergesort, and in the same pass
|
|
|
|
* we check for overlapping extents and adjust them appropriately.
|
|
|
|
*
|
|
|
|
* struct btree_op is a central interface to the btree code. It's used for
|
|
|
|
* specifying read vs. write locking, and the embedded closure is used for
|
|
|
|
* waiting on IO or reserve memory.
|
|
|
|
*
|
|
|
|
* BTREE CACHE:
|
|
|
|
*
|
|
|
|
* Btree nodes are cached in memory; traversing the btree might require reading
|
|
|
|
* in btree nodes which is handled mostly transparently.
|
|
|
|
*
|
|
|
|
* bch_btree_node_get() looks up a btree node in the cache and reads it in from
|
|
|
|
* disk if necessary. This function is almost never called directly though - the
|
|
|
|
* btree() macro is used to get a btree node, call some function on it, and
|
|
|
|
* unlock the node after the function returns.
|
|
|
|
*
|
|
|
|
* The root is special cased - it's taken out of the cache's lru (thus pinning
|
|
|
|
* it in memory), so we can find the root of the btree by just dereferencing a
|
|
|
|
* pointer instead of looking it up in the cache. This makes locking a bit
|
|
|
|
* tricky, since the root pointer is protected by the lock in the btree node it
|
|
|
|
* points to - the btree_root() macro handles this.
|
|
|
|
*
|
|
|
|
* In various places we must be able to allocate memory for multiple btree nodes
|
|
|
|
* in order to make forward progress. To do this we use the btree cache itself
|
|
|
|
* as a reserve; if __get_free_pages() fails, we'll find a node in the btree
|
|
|
|
* cache we can reuse. We can't allow more than one thread to be doing this at a
|
|
|
|
* time, so there's a lock, implemented by a pointer to the btree_op closure -
|
|
|
|
* this allows the btree_root() macro to implicitly release this lock.
|
|
|
|
*
|
|
|
|
* BTREE IO:
|
|
|
|
*
|
|
|
|
* Btree nodes never have to be explicitly read in; bch_btree_node_get() handles
|
|
|
|
* this.
|
|
|
|
*
|
|
|
|
* For writing, we have two btree_write structs embeddded in struct btree - one
|
|
|
|
* write in flight, and one being set up, and we toggle between them.
|
|
|
|
*
|
|
|
|
* Writing is done with a single function - bch_btree_write() really serves two
|
|
|
|
* different purposes and should be broken up into two different functions. When
|
|
|
|
* passing now = false, it merely indicates that the node is now dirty - calling
|
|
|
|
* it ensures that the dirty keys will be written at some point in the future.
|
|
|
|
*
|
|
|
|
* When passing now = true, bch_btree_write() causes a write to happen
|
|
|
|
* "immediately" (if there was already a write in flight, it'll cause the write
|
|
|
|
* to happen as soon as the previous write completes). It returns immediately
|
|
|
|
* though - but it takes a refcount on the closure in struct btree_op you passed
|
|
|
|
* to it, so a closure_sync() later can be used to wait for the write to
|
|
|
|
* complete.
|
|
|
|
*
|
|
|
|
* This is handy because btree_split() and garbage collection can issue writes
|
|
|
|
* in parallel, reducing the amount of time they have to hold write locks.
|
|
|
|
*
|
|
|
|
* LOCKING:
|
|
|
|
*
|
|
|
|
* When traversing the btree, we may need write locks starting at some level -
|
|
|
|
* inserting a key into the btree will typically only require a write lock on
|
|
|
|
* the leaf node.
|
|
|
|
*
|
|
|
|
* This is specified with the lock field in struct btree_op; lock = 0 means we
|
|
|
|
* take write locks at level <= 0, i.e. only leaf nodes. bch_btree_node_get()
|
|
|
|
* checks this field and returns the node with the appropriate lock held.
|
|
|
|
*
|
|
|
|
* If, after traversing the btree, the insertion code discovers it has to split
|
|
|
|
* then it must restart from the root and take new locks - to do this it changes
|
|
|
|
* the lock field and returns -EINTR, which causes the btree_root() macro to
|
|
|
|
* loop.
|
|
|
|
*
|
|
|
|
* Handling cache misses require a different mechanism for upgrading to a write
|
|
|
|
* lock. We do cache lookups with only a read lock held, but if we get a cache
|
|
|
|
* miss and we wish to insert this data into the cache, we have to insert a
|
|
|
|
* placeholder key to detect races - otherwise, we could race with a write and
|
|
|
|
* overwrite the data that was just written to the cache with stale data from
|
|
|
|
* the backing device.
|
|
|
|
*
|
|
|
|
* For this we use a sequence number that write locks and unlocks increment - to
|
|
|
|
* insert the check key it unlocks the btree node and then takes a write lock,
|
|
|
|
* and fails if the sequence number doesn't match.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include "bset.h"
|
|
|
|
#include "debug.h"
|
|
|
|
|
|
|
|
struct btree_write {
|
|
|
|
atomic_t *journal;
|
|
|
|
|
|
|
|
/* If btree_split() frees a btree node, it writes a new pointer to that
|
|
|
|
* btree node indicating it was freed; it takes a refcount on
|
|
|
|
* c->prio_blocked because we can't write the gens until the new
|
|
|
|
* pointer is on disk. This allows btree_write_endio() to release the
|
|
|
|
* refcount that btree_split() took.
|
|
|
|
*/
|
|
|
|
int prio_blocked;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct btree {
|
|
|
|
/* Hottest entries first */
|
|
|
|
struct hlist_node hash;
|
|
|
|
|
|
|
|
/* Key/pointer for this btree node */
|
|
|
|
BKEY_PADDED(key);
|
|
|
|
|
|
|
|
unsigned long seq;
|
|
|
|
struct rw_semaphore lock;
|
|
|
|
struct cache_set *c;
|
2013-07-25 08:20:19 +08:00
|
|
|
struct btree *parent;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2014-03-05 08:42:42 +08:00
|
|
|
struct mutex write_lock;
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
unsigned long flags;
|
|
|
|
uint16_t written; /* would be nice to kill */
|
|
|
|
uint8_t level;
|
|
|
|
|
2013-12-21 09:28:16 +08:00
|
|
|
struct btree_keys keys;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-04-26 04:58:35 +08:00
|
|
|
/* For outstanding btree writes, used as a lock - protects write_idx */
|
2013-12-17 07:27:25 +08:00
|
|
|
struct closure io;
|
|
|
|
struct semaphore io_mutex;
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
struct list_head list;
|
|
|
|
struct delayed_work work;
|
|
|
|
|
|
|
|
struct btree_write writes[2];
|
|
|
|
struct bio *bio;
|
|
|
|
};
|
|
|
|
|
bcache: make bch_btree_check() to be multithreaded
When registering a cache device, bch_btree_check() is called to check
all btree nodes, to make sure the btree is consistent and not
corrupted.
bch_btree_check() is recursively executed in a single thread, when there
are a lot of data cached and the btree is huge, it may take very long
time to check all the btree nodes. In my testing, I observed it took
around 50 minutes to finish bch_btree_check().
When checking the bcache btree nodes, the cache set is not running yet,
and indeed the whole tree is in read-only state, it is safe to create
multiple threads to check the btree in parallel.
This patch tries to create multiple threads, and each thread tries to
one-by-one check the sub-tree indexed by a key from the btree root node.
The parallel thread number depends on how many keys in the btree root
node. At most BCH_BTR_CHKTHREAD_MAX (64) threads can be created, but in
practice is should be min(cpu-number/2, root-node-keys-number).
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-22 14:03:01 +08:00
|
|
|
|
|
|
|
|
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
#define BTREE_FLAG(flag) \
|
|
|
|
static inline bool btree_node_ ## flag(struct btree *b) \
|
|
|
|
{ return test_bit(BTREE_NODE_ ## flag, &b->flags); } \
|
|
|
|
\
|
|
|
|
static inline void set_btree_node_ ## flag(struct btree *b) \
|
2018-08-09 15:48:51 +08:00
|
|
|
{ set_bit(BTREE_NODE_ ## flag, &b->flags); }
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
enum btree_flags {
|
|
|
|
BTREE_NODE_io_error,
|
|
|
|
BTREE_NODE_dirty,
|
|
|
|
BTREE_NODE_write_idx,
|
bcache: fix race in btree_flush_write()
There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.
Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.
This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.
Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.
The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.
Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
2149 err_free2:
2150 bkey_put(b->c, &n2->key);
2151 btree_node_free(n2);
2152 rw_unlock(true, n2);
2153 err_free1:
2154 bkey_put(b->c, &n1->key);
2155 btree_node_free(n1);
2156 rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.
Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 19:59:58 +08:00
|
|
|
BTREE_NODE_journal_flush,
|
2013-03-24 07:11:31 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
BTREE_FLAG(io_error);
|
|
|
|
BTREE_FLAG(dirty);
|
|
|
|
BTREE_FLAG(write_idx);
|
bcache: fix race in btree_flush_write()
There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.
Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.
This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.
Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.
The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.
Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
2149 err_free2:
2150 bkey_put(b->c, &n2->key);
2151 btree_node_free(n2);
2152 rw_unlock(true, n2);
2153 err_free1:
2154 bkey_put(b->c, &n1->key);
2155 btree_node_free(n1);
2156 rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.
Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 19:59:58 +08:00
|
|
|
BTREE_FLAG(journal_flush);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
static inline struct btree_write *btree_current_write(struct btree *b)
|
|
|
|
{
|
|
|
|
return b->writes + btree_node_write_idx(b);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct btree_write *btree_prev_write(struct btree *b)
|
|
|
|
{
|
|
|
|
return b->writes + (btree_node_write_idx(b) ^ 1);
|
|
|
|
}
|
|
|
|
|
2013-12-18 13:46:35 +08:00
|
|
|
static inline struct bset *btree_bset_first(struct btree *b)
|
|
|
|
{
|
2013-12-21 09:28:16 +08:00
|
|
|
return b->keys.set->data;
|
2013-12-18 13:46:35 +08:00
|
|
|
}
|
|
|
|
|
2013-12-18 15:49:49 +08:00
|
|
|
static inline struct bset *btree_bset_last(struct btree *b)
|
|
|
|
{
|
2013-12-21 09:28:16 +08:00
|
|
|
return bset_tree_last(&b->keys)->data;
|
2013-12-18 13:46:35 +08:00
|
|
|
}
|
|
|
|
|
2018-08-11 13:19:44 +08:00
|
|
|
static inline unsigned int bset_block_offset(struct btree *b, struct bset *i)
|
2013-12-18 13:46:35 +08:00
|
|
|
{
|
2013-12-21 09:28:16 +08:00
|
|
|
return bset_sector_offset(&b->keys, i) >> b->c->block_bits;
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void set_gc_sectors(struct cache_set *c)
|
|
|
|
{
|
2020-10-01 14:50:56 +08:00
|
|
|
atomic_set(&c->sectors_to_gc, c->cache->sb.bucket_size * c->nbuckets / 16);
|
2013-03-24 07:11:31 +08:00
|
|
|
}
|
|
|
|
|
2013-07-25 07:46:42 +08:00
|
|
|
void bkey_put(struct cache_set *c, struct bkey *k);
|
2013-09-11 09:39:16 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Looping macros */
|
|
|
|
|
|
|
|
#define for_each_cached_btree(b, c, iter) \
|
|
|
|
for (iter = 0; \
|
|
|
|
iter < ARRAY_SIZE((c)->bucket_hash); \
|
|
|
|
iter++) \
|
|
|
|
hlist_for_each_entry_rcu((b), (c)->bucket_hash + iter, hash)
|
|
|
|
|
|
|
|
/* Recursing down the btree */
|
|
|
|
|
|
|
|
struct btree_op {
|
2013-12-17 17:29:34 +08:00
|
|
|
/* for waiting on btree reserve in btree_split() */
|
2017-06-20 18:06:13 +08:00
|
|
|
wait_queue_entry_t wait;
|
2013-12-17 17:29:34 +08:00
|
|
|
|
2013-03-24 07:11:31 +08:00
|
|
|
/* Btree level at which we start taking write locks */
|
|
|
|
short lock;
|
|
|
|
|
2018-08-11 13:19:44 +08:00
|
|
|
unsigned int insert_collision:1;
|
2013-03-24 07:11:31 +08:00
|
|
|
};
|
|
|
|
|
bcache: make bch_btree_check() to be multithreaded
When registering a cache device, bch_btree_check() is called to check
all btree nodes, to make sure the btree is consistent and not
corrupted.
bch_btree_check() is recursively executed in a single thread, when there
are a lot of data cached and the btree is huge, it may take very long
time to check all the btree nodes. In my testing, I observed it took
around 50 minutes to finish bch_btree_check().
When checking the bcache btree nodes, the cache set is not running yet,
and indeed the whole tree is in read-only state, it is safe to create
multiple threads to check the btree in parallel.
This patch tries to create multiple threads, and each thread tries to
one-by-one check the sub-tree indexed by a key from the btree root node.
The parallel thread number depends on how many keys in the btree root
node. At most BCH_BTR_CHKTHREAD_MAX (64) threads can be created, but in
practice is should be min(cpu-number/2, root-node-keys-number).
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-22 14:03:01 +08:00
|
|
|
struct btree_check_state;
|
|
|
|
struct btree_check_info {
|
|
|
|
struct btree_check_state *state;
|
|
|
|
struct task_struct *thread;
|
|
|
|
int result;
|
|
|
|
};
|
|
|
|
|
|
|
|
#define BCH_BTR_CHKTHREAD_MAX 64
|
|
|
|
struct btree_check_state {
|
|
|
|
struct cache_set *c;
|
|
|
|
int total_threads;
|
|
|
|
int key_idx;
|
|
|
|
spinlock_t idx_lock;
|
|
|
|
atomic_t started;
|
|
|
|
atomic_t enough;
|
|
|
|
wait_queue_head_t wait;
|
|
|
|
struct btree_check_info infos[BCH_BTR_CHKTHREAD_MAX];
|
|
|
|
};
|
|
|
|
|
2013-07-25 09:04:18 +08:00
|
|
|
static inline void bch_btree_op_init(struct btree_op *op, int write_lock_level)
|
|
|
|
{
|
|
|
|
memset(op, 0, sizeof(struct btree_op));
|
2013-12-17 17:29:34 +08:00
|
|
|
init_wait(&op->wait);
|
2013-07-25 09:04:18 +08:00
|
|
|
op->lock = write_lock_level;
|
|
|
|
}
|
2013-03-24 07:11:31 +08:00
|
|
|
|
|
|
|
static inline void rw_lock(bool w, struct btree *b, int level)
|
|
|
|
{
|
|
|
|
w ? down_write_nested(&b->lock, level + 1)
|
|
|
|
: down_read_nested(&b->lock, level + 1);
|
|
|
|
if (w)
|
|
|
|
b->seq++;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void rw_unlock(bool w, struct btree *b)
|
|
|
|
{
|
|
|
|
if (w)
|
|
|
|
b->seq++;
|
|
|
|
(w ? up_write : up_read)(&b->lock);
|
|
|
|
}
|
|
|
|
|
2018-08-11 13:19:46 +08:00
|
|
|
void bch_btree_node_read_done(struct btree *b);
|
|
|
|
void __bch_btree_node_write(struct btree *b, struct closure *parent);
|
|
|
|
void bch_btree_node_write(struct btree *b, struct closure *parent);
|
|
|
|
|
|
|
|
void bch_btree_set_root(struct btree *b);
|
|
|
|
struct btree *__bch_btree_node_alloc(struct cache_set *c, struct btree_op *op,
|
|
|
|
int level, bool wait,
|
|
|
|
struct btree *parent);
|
|
|
|
struct btree *bch_btree_node_get(struct cache_set *c, struct btree_op *op,
|
|
|
|
struct bkey *k, int level, bool write,
|
|
|
|
struct btree *parent);
|
|
|
|
|
|
|
|
int bch_btree_insert_check_key(struct btree *b, struct btree_op *op,
|
|
|
|
struct bkey *check_key);
|
|
|
|
int bch_btree_insert(struct cache_set *c, struct keylist *keys,
|
|
|
|
atomic_t *journal_ref, struct bkey *replace_key);
|
|
|
|
|
|
|
|
int bch_gc_thread_start(struct cache_set *c);
|
|
|
|
void bch_initial_gc_finish(struct cache_set *c);
|
|
|
|
void bch_moving_gc(struct cache_set *c);
|
|
|
|
int bch_btree_check(struct cache_set *c);
|
|
|
|
void bch_initial_mark_key(struct cache_set *c, int level, struct bkey *k);
|
2013-03-24 07:11:31 +08:00
|
|
|
|
2013-10-25 08:19:26 +08:00
|
|
|
static inline void wake_up_gc(struct cache_set *c)
|
|
|
|
{
|
2016-10-27 11:31:17 +08:00
|
|
|
wake_up(&c->gc_wait);
|
2013-10-25 08:19:26 +08:00
|
|
|
}
|
|
|
|
|
2018-12-13 22:53:52 +08:00
|
|
|
static inline void force_wake_up_gc(struct cache_set *c)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Garbage collection thread only works when sectors_to_gc < 0,
|
|
|
|
* calling wake_up_gc() won't start gc thread if sectors_to_gc is
|
|
|
|
* not a nagetive value.
|
|
|
|
* Therefore sectors_to_gc is set to -1 here, before waking up
|
|
|
|
* gc thread by calling wake_up_gc(). Then gc_should_run() will
|
|
|
|
* give a chance to permit gc thread to run. "Give a chance" means
|
|
|
|
* before going into gc_should_run(), there is still possibility
|
|
|
|
* that c->sectors_to_gc being set to other positive value. So
|
|
|
|
* this routine won't 100% make sure gc thread will be woken up
|
|
|
|
* to run.
|
|
|
|
*/
|
|
|
|
atomic_set(&c->sectors_to_gc, -1);
|
|
|
|
wake_up_gc(c);
|
|
|
|
}
|
|
|
|
|
2020-03-22 14:02:59 +08:00
|
|
|
/*
|
|
|
|
* These macros are for recursing down the btree - they handle the details of
|
|
|
|
* locking and looking up nodes in the cache for you. They're best treated as
|
|
|
|
* mere syntax when reading code that uses them.
|
|
|
|
*
|
|
|
|
* op->lock determines whether we take a read or a write lock at a given depth.
|
|
|
|
* If you've got a read lock and find that you need a write lock (i.e. you're
|
|
|
|
* going to have to split), set op->lock and return -EINTR; btree_root() will
|
|
|
|
* call you again and you'll have the correct lock.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/**
|
|
|
|
* btree - recurse down the btree on a specified key
|
|
|
|
* @fn: function to call, which will be passed the child node
|
|
|
|
* @key: key to recurse on
|
|
|
|
* @b: parent btree node
|
|
|
|
* @op: pointer to struct btree_op
|
|
|
|
*/
|
2020-03-22 14:03:00 +08:00
|
|
|
#define bcache_btree(fn, key, b, op, ...) \
|
2020-03-22 14:02:59 +08:00
|
|
|
({ \
|
|
|
|
int _r, l = (b)->level - 1; \
|
|
|
|
bool _w = l <= (op)->lock; \
|
|
|
|
struct btree *_child = bch_btree_node_get((b)->c, op, key, l, \
|
|
|
|
_w, b); \
|
|
|
|
if (!IS_ERR(_child)) { \
|
|
|
|
_r = bch_btree_ ## fn(_child, op, ##__VA_ARGS__); \
|
|
|
|
rw_unlock(_w, _child); \
|
|
|
|
} else \
|
|
|
|
_r = PTR_ERR(_child); \
|
|
|
|
_r; \
|
|
|
|
})
|
|
|
|
|
|
|
|
/**
|
|
|
|
* btree_root - call a function on the root of the btree
|
|
|
|
* @fn: function to call, which will be passed the child node
|
|
|
|
* @c: cache set
|
|
|
|
* @op: pointer to struct btree_op
|
|
|
|
*/
|
2020-03-22 14:03:00 +08:00
|
|
|
#define bcache_btree_root(fn, c, op, ...) \
|
2020-03-22 14:02:59 +08:00
|
|
|
({ \
|
|
|
|
int _r = -EINTR; \
|
|
|
|
do { \
|
|
|
|
struct btree *_b = (c)->root; \
|
|
|
|
bool _w = insert_lock(op, _b); \
|
|
|
|
rw_lock(_w, _b, _b->level); \
|
|
|
|
if (_b == (c)->root && \
|
|
|
|
_w == insert_lock(op, _b)) { \
|
|
|
|
_r = bch_btree_ ## fn(_b, op, ##__VA_ARGS__); \
|
|
|
|
} \
|
|
|
|
rw_unlock(_w, _b); \
|
|
|
|
bch_cannibalize_unlock(c); \
|
|
|
|
if (_r == -EINTR) \
|
|
|
|
schedule(); \
|
|
|
|
} while (_r == -EINTR); \
|
|
|
|
\
|
|
|
|
finish_wait(&(c)->btree_cache_wait, &(op)->wait); \
|
|
|
|
_r; \
|
|
|
|
})
|
|
|
|
|
2013-09-11 09:48:51 +08:00
|
|
|
#define MAP_DONE 0
|
|
|
|
#define MAP_CONTINUE 1
|
|
|
|
|
|
|
|
#define MAP_ALL_NODES 0
|
|
|
|
#define MAP_LEAF_NODES 1
|
|
|
|
|
|
|
|
#define MAP_END_KEY 1
|
|
|
|
|
2018-08-11 13:19:46 +08:00
|
|
|
typedef int (btree_map_nodes_fn)(struct btree_op *b_op, struct btree *b);
|
|
|
|
int __bch_btree_map_nodes(struct btree_op *op, struct cache_set *c,
|
|
|
|
struct bkey *from, btree_map_nodes_fn *fn, int flags);
|
2013-09-11 09:48:51 +08:00
|
|
|
|
|
|
|
static inline int bch_btree_map_nodes(struct btree_op *op, struct cache_set *c,
|
|
|
|
struct bkey *from, btree_map_nodes_fn *fn)
|
|
|
|
{
|
|
|
|
return __bch_btree_map_nodes(op, c, from, fn, MAP_ALL_NODES);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int bch_btree_map_leaf_nodes(struct btree_op *op,
|
|
|
|
struct cache_set *c,
|
|
|
|
struct bkey *from,
|
|
|
|
btree_map_nodes_fn *fn)
|
|
|
|
{
|
|
|
|
return __bch_btree_map_nodes(op, c, from, fn, MAP_LEAF_NODES);
|
|
|
|
}
|
|
|
|
|
2018-08-11 13:19:46 +08:00
|
|
|
typedef int (btree_map_keys_fn)(struct btree_op *op, struct btree *b,
|
|
|
|
struct bkey *k);
|
|
|
|
int bch_btree_map_keys(struct btree_op *op, struct cache_set *c,
|
|
|
|
struct bkey *from, btree_map_keys_fn *fn, int flags);
|
2020-03-25 09:30:57 +08:00
|
|
|
int bch_btree_map_keys_recurse(struct btree *b, struct btree_op *op,
|
|
|
|
struct bkey *from, btree_map_keys_fn *fn,
|
|
|
|
int flags);
|
2018-08-11 13:19:46 +08:00
|
|
|
|
|
|
|
typedef bool (keybuf_pred_fn)(struct keybuf *buf, struct bkey *k);
|
|
|
|
|
|
|
|
void bch_keybuf_init(struct keybuf *buf);
|
|
|
|
void bch_refill_keybuf(struct cache_set *c, struct keybuf *buf,
|
|
|
|
struct bkey *end, keybuf_pred_fn *pred);
|
|
|
|
bool bch_keybuf_check_overlapping(struct keybuf *buf, struct bkey *start,
|
|
|
|
struct bkey *end);
|
|
|
|
void bch_keybuf_del(struct keybuf *buf, struct keybuf_key *w);
|
|
|
|
struct keybuf_key *bch_keybuf_next(struct keybuf *buf);
|
2018-08-11 13:19:47 +08:00
|
|
|
struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *c,
|
|
|
|
struct keybuf *buf,
|
|
|
|
struct bkey *end,
|
|
|
|
keybuf_pred_fn *pred);
|
2017-10-31 05:46:33 +08:00
|
|
|
void bch_update_bucket_in_use(struct cache_set *c, struct gc_stat *stats);
|
2013-03-24 07:11:31 +08:00
|
|
|
#endif
|