Slowly working on pruning struct btree_op - the aim is for it to only
contain things that are actually necessary for traversing the btree.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Making things less asynchronous that don't need to be - bch_journal()
only has to block when the journal or journal entry is full, which is
emphatically not a fast path. So make it a normal function that just
returns when it finishes, to make the code and control flow easier to
follow.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Some refactoring - better to explicitly pass stuff around instead of
having it all in the "big bag of state", struct btree_op. Going to prune
struct btree_op quite a bit over time.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This was the main point of all this refactoring - now,
btree_insert_check_key() won't fail just because the leaf node happened
to be full.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
We'll often end up with a list of adjacent keys to insert -
because bch_data_insert() may have to fragment the data it writes.
Originally, to simplify things and avoid having to deal with corner
cases bch_btree_insert() would pass keys from this list one at a time to
btree_insert_recurse() - mainly because the list of keys might span leaf
nodes, so it was easier this way.
With the btree_insert_node() refactoring, it's now a lot easier to just
pass down the whole list and have btree_insert_recurse() iterate over
leaf nodes until it's done.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The flow of control in the old btree insertion code was rather -
backwards; we'd recurse down the btree (in btree_insert_recurse()), and
then if we needed to split the keys to be inserted into the parent node
would be effectively returned up to btree_insert_recurse(), which would
notice there was more work to do and finish the insertion.
The main problem with this was that the full logic for btree insertion
could only be used by calling btree_insert_recurse; if you'd gotten to a
btree leaf some other way and had a key to insert, if it turned out that
node needed to be split you were SOL.
This inverts the flow of control so btree_insert_node() does _full_
btree insertion, including splitting - and takes a (leaf) btree node to
insert into as a parameter.
This means we can now _correctly_ handle cache misses - for cache
misses, we need to insert a fake "check" key into the btree when we
discover we have a cache miss - while we still have the btree locked.
Previously, if the btree node was full inserting a cache miss would just
fail.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This is prep work for the reworked btree insertion code.
The way we set b->parent is ugly and hacky... the problem is, when
btree_split() or garbage collection splits or rewrites a btree node, the
parent changes for all its (potentially already cached) children.
I may change this later and add some code to look through the btree node
cache and find all our cached child nodes and change the parent pointer
then...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
GFP_NOIO means we could be getting called recursively - mca_alloc() ->
mca_data_alloc() - definitely can't use mutex_lock(bucket_lock) then.
Whoops.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix
drivers/md/bcache/btree.c: In function ‘bch_btree_node_read’:
drivers/md/bcache/btree.c:259: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘size_t’
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert the driver shrinkers to the new API. Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.
FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out. The amount of broken code I just encountered is
mind boggling. I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.
Special mention goes to the zcache/zcache2 drivers. They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth. And that doesn't even take into account the horrible,
broken code...
[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Part of the job of garbage collection is to add up however many sectors
of live data it finds in each bucket, but that doesn't work very well if
it doesn't reset GC_SECTORS_USED() when it starts. Whoops.
This wouldn't have broken anything horribly, but allocation tries to
preferentially reclaim buckets that are mostly empty and that's not
gonna work with an incorrect GC_SECTORS_USED() value.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Some of bcache's utility code has made it into the rest of the kernel,
so drop the bcache versions.
Bcache used to have a workaround for allocating from a bio set under
generic_make_request() (if you allocated more than once, the bios you
already allocated would get stuck on current->bio_list when you
submitted, and you'd risk deadlock) - bcache would mask out __GFP_WAIT
when allocating bios under generic_make_request() so that allocation
could fail and it could retry from workqueue. But bio_alloc_bioset() has
a workaround now, so we can drop this hack and the associated error
handling.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Journal writes need to be marked FUA, not just REQ_FLUSH. And btree node
writes have... weird ordering requirements.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Now that we're tracking dirty data per stripe, we can add two
optimizations for raid5/6:
* If a stripe is already dirty, force writes to that stripe to
writeback mode - to help build up full stripes of dirty data
* When flushing dirty data, preferentially write out full stripes first
if there are any.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
To make background writeback aware of raid5/6 stripes, we first need to
track the amount of dirty data within each stripe - we do this by
breaking up the existing sectors_dirty into per stripe atomic_ts
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Previously, dirty_data wouldn't get initialized until the first garbage
collection... which was a bit of a problem for background writeback (as
the PD controller keys off of it) and also confusing for users.
This is also prep work for making background writeback aware of raid5/6
stripes.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Old gcc doesnt like the struct hack, and it is kind of ugly. So finish
off the work to convert pr_debug() statements to tracepoints, and delete
pkey()/pbtree().
Signed-off-by: Kent Overstreet <koverstreet@google.com>
The tracepoints were reworked to be more sensible, and fixed a null
pointer deref in one of the tracepoints.
Converted some of the pr_debug()s to tracepoints - this is partly a
performance optimization; it used to be that with DEBUG or
CONFIG_DYNAMIC_DEBUG pr_debug() was an empty macro; but at some point it
was changed to an empty inline function.
Some of the pr_debug() statements had rather expensive function calls as
part of the arguments, so this code was getting run unnecessarily even
on non debug kernels - in some fast paths, too.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
The most significant change is that btree reads are now done
synchronously, instead of asynchronously and doing the post read stuff
from a workqueue.
This was originally done because we can't block on IO under
generic_make_request(). But - we already have a mechanism to punt cache
lookups to workqueue if needed, so if we just use that we don't have to
deal with the complexity of doing things asynchronously.
The main benefit is this makes the locking situation saner; we can hold
our write lock on the btree node until we're finished reading it, and we
don't need that btree_node_read_done() flag anymore.
Also, for writes, btree_write() was broken out into btree_node_write()
and btree_leaf_dirty() - the old code with the boolean argument was dumb
and confusing.
The prio_blocked mechanism was improved a bit too, now the only counter
is in struct btree_write, we don't mess with transfering a count from
struct btree anymore.
This required changing garbage collection to block prios at the start
and unblock when it finishes, which is cleaner than what it was doing
anyways (the old code had mostly the same effect, but was doing it in a
convoluted way)
And the btree iter btree_node_read_done() uses was converted to a real
mempool.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
The main fix is that bch_allocator_thread() wasn't waiting on
garbage collection to finish (if invalidate_buckets had set
ca->invalidate_needs_gc); we need that to make sure the allocator
doesn't spin and potentially block gc from finishing.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
m68k/allmodconfig:
drivers/md/bcache/bset.c: In function ‘bset_search_tree’:
drivers/md/bcache/bset.c:727: error: implicit declaration of function ‘prefetch’
drivers/md/bcache/btree.c: In function ‘bch_btree_node_get’:
drivers/md/bcache/btree.c:933: error: implicit declaration of function ‘prefetch’
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Took out some nested functions, and fixed some more checkpatch
complaints.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: linux-bcache@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
config: make ARCH=i386 allmodconfig
All error/warnings:
drivers/md/bcache/bset.c: In function 'bch_ptr_bad':
>> drivers/md/bcache/bset.c:164:2: warning: format '%li' expects argument of type 'long int', but argument 4 has type 'size_t' [-Wformat]
--
drivers/md/bcache/debug.c: In function 'bch_pbtree':
>> drivers/md/bcache/debug.c:86:4: warning: format '%li' expects argument of type 'long int', but argument 4 has type 'size_t' [-Wformat]
--
drivers/md/bcache/btree.c: In function 'bch_btree_read_done':
>> drivers/md/bcache/btree.c:245:8: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'size_t' [-Wformat]
--
drivers/md/bcache/closure.o: In function `closure_debug_init':
>> (.init.text+0x0): multiple definition of `init_module'
>> drivers/md/bcache/super.o:super.c:(.init.text+0x0): first defined here
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-bcache@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Does writethrough and writeback caching, handles unclean shutdown, and
has a bunch of other nifty features motivated by real world usage.
See the wiki at http://bcache.evilpiepirate.org for more.
Signed-off-by: Kent Overstreet <koverstreet@google.com>