Fixes below error with clang:
../drivers/md/bcache/sysfs.c:759:3: error: function definition is not allowed here
{ return *((uint16_t *) r) - *((uint16_t *) l); }
^
../drivers/md/bcache/sysfs.c:789:32: error: use of undeclared identifier 'cmp'
sort(p, n, sizeof(uint16_t), cmp, NULL);
^
2 errors generated.
v2:
rename function to __bch_cache_cmp
Signed-off-by: Peter Foley <pefoley2@pefoley.com>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 09b3efec ("bcache: Don't reinvent the wheel but use existing llist
API") replaces the following while loop by llist_for_each_entry(),
-
- while (reverse) {
- cl = container_of(reverse, struct closure, list);
- reverse = llist_next(reverse);
-
+ llist_for_each_entry(cl, reverse, list) {
closure_set_waiting(cl, 0);
closure_sub(cl, CLOSURE_WAITING + 1);
}
This modification introduces a potential race by iterating a corrupted
list. Here is how it happens.
In the above modification, closure_sub() may wake up a process which is
waiting on reverse list. If this process decides to wait again by calling
closure_wait(), its cl->list will be added to another wait list. Then
when llist_for_each_entry() continues to iterate next node, it will travel
on another new wait list which is added in closure_wait(), not the
original reverse list in __closure_wake_up(). It is more probably to
happen on UP machine because the waked up process may preempt the process
which wakes up it.
Use llist_for_each_entry_safe() will fix the issue, the safe version fetch
next node before waking up a process. Then the copy of next node will make
sure list iteration stays on original reverse list.
Fixes: 09b3efec81 ("bcache: Don't reinvent the wheel but use existing llist API")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bcache uses a Proportion-Differentiation Controller algorithm to control
writeback rate to cached devices. In the PD controller algorithm, dirty
stripes of thin flash device should not be counted in, because flash only
volumes never write back dirty data.
Currently dirty stripe counter for thin flash device is not initialized
when the thin flash device starts. Which means the following calculation
in PD controller will reference an undefined dirty stripes number, and
all cached devices attached to the same cache set where the thin flash
device lies on may have an inaccurate writeback rate.
This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
correctly initialize dirty stripe counter when the thin flash device
starts to run. This patch also does following parameter data type change,
-void bch_sectors_dirty_init(struct cached_dev *dc);
+void bch_sectors_dirty_init(struct bcache_device *);
to call this function conveniently in flash_dev_run().
(Commit log is composed by Coly Li)
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Most importantly, solve a crash where %llu was used to format signed
numbers. This would cause a buffer overflow when reading sysfs
writeback_rate_debug, as only 20 bytes were allocated for this and
%llu writes 20 characters plus a null.
Always use the units mechanism rather than having different output
paths for simplicity.
Also, correct problems with display output where 1.10 was a larger
number than 1.09, by multiplying by 10 and then dividing by 1024 instead
of dividing by 100. (Remainders of >= 1000 would print as .10).
Minor changes: Always display the decimal point instead of trying to
omit it based on number of digits shown. Decide what units to use
based on 1000 as a threshold, not 1024 (in other words, always print
at most 3 digits before the decimal point).
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reported-by: Dmitry Yu Okunev <dyokunev@ut.mephi.ru>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
continue_at() doesn't have a return statement anymore.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In olden times, closure_return() used to have a hidden return built in.
We removed the hidden return but forgot to add a new return here. If
"c" were NULL we would oops on the next line, but fortunately "c" is
never NULL. Let's just remove the if statement.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In currently, we only alloc 6 open buckets for each cache set,
but in usually, we always attach about 10 or so backend devices for
each cache set, and the each bcache device are always accessed by
about 10 or so threads in top application layer. So 6 open buckets
are too few, It has led to that each of the same thread write data
to different buckets, which would cause low efficiency write-back,
and also cause buckets inefficient, and would be Very easy to run
out of.
I add debug message in bch_open_buckets_alloc() to print alloc bucket
info, and test with ten bcache devices with a cache set, and each
bcache device is accessed by ten threads.
From the debug message, we can see that, after the modification, One
bucket is more likely to assign to the same thread, and the data from
the same thread are more likely to write the same bucket. Usually the
same thread always write/read the same backend device, so it is good
for write-back and also promote the usage efficiency of buckets.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If you encounter any errors in bch_cached_dev_attach it will return
a negative error code. The variable 'v' which stores the result is
unsigned, thus user space sees a very large value returned for bytes
written which can cause incorrect user space behavior. Utilize 1
signed variable to use throughout the function to preserve error return
capability.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Acked-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__update_write_rate() uses a Proportion-Differentiation Controller
algorithm to control writeback rate. A dirty target number is used in
this PD controller to control writeback rate. A larger target number
will make the writeback rate smaller, on the versus, a smaller target
number will make the writeback rate larger.
bcache uses the following steps to calculate the target number,
1) cache_sectors = all-buckets-of-cache-set * buckets-size
2) cache_dirty_target = cache_sectors * cached-device-writeback_percent
3) target = cache_dirty_target *
(sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set)
The calculation at step 1) for cache_sectors is incorrect, which does
not consider dirty blocks occupied by flash only volume.
A flash only volume can be took as a bcache device without cached
device. All data sectors allocated for it are persistent on cache device
and marked dirty, they are not touched by bcache writeback and garbage
collection code. So data blocks of flash only volume should be ignore
when calculating cache_sectors of cache set.
Current code does not subtract dirty sectors of flash only volume, which
results a larger target number from the above 3 steps. And in sequence
the cache device's writeback rate is smaller then a correct value,
writeback speed is slower on all cached devices.
This patch fixes the incorrect slower writeback rate by subtracting
dirty sectors of flash only volumes in __update_writeback_rate().
(Commit log composed by Coly Li to pass checkpatch.pl checking)
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
I try to execute the following command to trigger gc thread:
[root@localhost internal]# echo 1 > trigger_gc
But it does not work, I debug the code in gc_should_run(), It works only
if in invalidating or sectors_to_gc < 0. So set sectors_to_gc to -1 to
meet the condition when we trigger gc by manual command.
(Code comments aded by Coly Li)
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Although llist provides proper APIs, they are not used. Make them used.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since bypassed IOs use no bucket, so do not subtract sectors_to_gc to
trigger gc thread.
Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sequential write IOs were tested with bs=1M by FIO in writeback cache
mode, these IOs were expected to be bypassed, but actually they did not.
We debug the code, and find in check_should_bypass():
if (!congested &&
mode == CACHE_MODE_WRITEBACK &&
op_is_write(bio_op(bio)) &&
(bio->bi_opf & REQ_SYNC))
goto rescale
that means, If in writeback mode, a write IO with REQ_SYNC flag will not
be bypassed though it is a sequential large IO, It's not a correct thing
to do actually, so this patch remove these codes.
Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
block device using lookup_bdev() to detect which situation we are in to
properly report error. However we never drop the reference returned to
us from lookup_bdev(). Fix that.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Add the SYSTEM_SCHEDULING bootup state to move various scheduler
debug checks earlier into the bootup. This turns silent and
sporadically deadly bugs into nice, deterministic splats. Fix some
of the splats that triggered. (Thomas Gleixner)
- A round of restructuring and refactoring of the load-balancing and
topology code (Peter Zijlstra)
- Another round of consolidating ~20 of incremental scheduler code
history: this time in terms of wait-queue nomenclature. (I didn't
get much feedback on these renaming patches, and we can still
easily change any names I might have misplaced, so if anyone hates
a new name, please holler and I'll fix it.) (Ingo Molnar)
- sched/numa improvements, fixes and updates (Rik van Riel)
- Another round of x86/tsc scheduler clock code improvements, in hope
of making it more robust (Peter Zijlstra)
- Improve NOHZ behavior (Frederic Weisbecker)
- Deadline scheduler improvements and fixes (Luca Abeni, Daniel
Bristot de Oliveira)
- Simplify and optimize the topology setup code (Lauro Ramos
Venancio)
- Debloat and decouple scheduler code some more (Nicolas Pitre)
- Simplify code by making better use of llist primitives (Byungchul
Park)
- ... plus other fixes and improvements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
sched/cputime: Refactor the cputime_adjust() code
sched/debug: Expose the number of RT/DL tasks that can migrate
sched/numa: Hide numa_wake_affine() from UP build
sched/fair: Remove effective_load()
sched/numa: Implement NUMA node level wake_affine()
sched/fair: Simplify wake_affine() for the single socket case
sched/numa: Override part of migrate_degrades_locality() when idle balancing
sched/rt: Move RT related code from sched/core.c to sched/rt.c
sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
sched/fair: Spare idle load balancing on nohz_full CPUs
nohz: Move idle balancer registration to the idle path
sched/loadavg: Generalize "_idle" naming to "_nohz"
sched/core: Drop the unused try_get_task_struct() helper function
sched/fair: WARN() and refuse to set buddy when !se->on_rq
sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
...
Rename:
wait_queue_t => wait_queue_entry_t
'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.
Start sorting this out by renaming it to 'wait_queue_entry_t'.
This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This function allocates a bio, then a collection
of pages. It copes with failure.
It currently uses a mempool() to allocate the bio,
but alloc_page() to allocate the pages. These fail
in different ways, so the usage is inconsistent.
Change the bio_clone() to bio_clone_kmalloc()
so that no pool is used either for the bio or the pages.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by : Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch converts bioset_create() to not create a workqueue by
default, so alloctions will never trigger punt_bios_to_rescuer(). It
also introduces a new flag BIOSET_NEED_RESCUER which tells
bioset_create() to preserve the old behavior.
All callers of bioset_create() that are inside block device drivers,
are given the BIOSET_NEED_RESCUER flag.
biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios. This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, and one allocated by
target_core_iblock.c.
biosets used by md/raid do not need rescuing as
their usage was recently audited and revised to never
risk deadlock.
It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes)
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().
To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().
Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
bcache_device_init uses kmalloc for small requests and vmalloc for those
which are larger than 64 pages. This alone is a strange criterion.
Moreover kmalloc can fallback to vmalloc on the failure. Let's simply
use kvmalloc instead as it knows how to handle the fallback properly
Link: http://lkml.kernel.org/r/20170306103327.2766-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are many code paths opencoding kvmalloc. Let's use the helper
instead. The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator. E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation. This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously. There is no guarantee something like that happens
though.
This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.
Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't actually need the full rculist.h header in sched.h anymore,
we will be able to include the smaller rcupdate.h header instead.
But first update code that relied on the implicit header inclusion.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.
Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
This centralizes the checks for bios that needs to be go into the flush
state machine.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead we use standard iterator way to do that.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.
After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixed up the new O_DIRECT cases.
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
(and remove one layer of masking for the op_is_write call next to it).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
bio_free_pages is introduced in commit 1dfa0f68c0
("block: add a helper to free bio bounce buffer pages"),
we can reuse the func in other modules after it was
imported.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The original error was thought to be corruption, but was actually caused by:
make-bcache --data-offset N
where N was in bytes and should have been in sectors. While userspace
tools should be updated to check --data-offset beyond end of volume,
hopefully this will help others that might not have noticed the units.
Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
This patch fixes a cachedev registration-time allocation deadlock.
This can deadlock on boot if your initrd auto-registeres bcache devices:
Allocator thread:
[ 720.727614] INFO: task bcache_allocato:3833 blocked for more than 120 seconds.
[ 720.732361] [<ffffffff816eeac7>] schedule+0x37/0x90
[ 720.732963] [<ffffffffa05192b8>] bch_bucket_alloc+0x188/0x360 [bcache]
[ 720.733538] [<ffffffff810e6950>] ? prepare_to_wait_event+0xf0/0xf0
[ 720.734137] [<ffffffffa05302bd>] bch_prio_write+0x19d/0x340 [bcache]
[ 720.734715] [<ffffffffa05190bf>] bch_allocator_thread+0x3ff/0x470 [bcache]
[ 720.735311] [<ffffffff816ee41c>] ? __schedule+0x2dc/0x950
[ 720.735884] [<ffffffffa0518cc0>] ? invalidate_buckets+0x980/0x980 [bcache]
Registration thread:
[ 720.710403] INFO: task bash:3531 blocked for more than 120 seconds.
[ 720.715226] [<ffffffff816eeac7>] schedule+0x37/0x90
[ 720.715805] [<ffffffffa05235cd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[ 720.716409] [<ffffffffa0522d30>] ? bch_btree_insert_check_key+0x1c0/0x1c0 [bcache]
[ 720.717008] [<ffffffffa05236e4>] bch_btree_insert+0xf4/0x170 [bcache]
[ 720.717586] [<ffffffff810e6950>] ? prepare_to_wait_event+0xf0/0xf0
[ 720.718191] [<ffffffffa0527d9a>] bch_journal_replay+0x14a/0x290 [bcache]
[ 720.718766] [<ffffffff810cc90d>] ? ttwu_do_activate.constprop.94+0x5d/0x70
[ 720.719369] [<ffffffff810cf684>] ? try_to_wake_up+0x1d4/0x350
[ 720.719968] [<ffffffffa05317d0>] run_cache_set+0x580/0x8e0 [bcache]
[ 720.720553] [<ffffffffa053302e>] register_bcache+0xe2e/0x13b0 [bcache]
[ 720.721153] [<ffffffff81354cef>] kobj_attr_store+0xf/0x20
[ 720.721730] [<ffffffff812a2dad>] sysfs_kf_write+0x3d/0x50
[ 720.722327] [<ffffffff812a225a>] kernfs_fop_write+0x12a/0x180
[ 720.722904] [<ffffffff81225177>] __vfs_write+0x37/0x110
[ 720.723503] [<ffffffff81228048>] ? __sb_start_write+0x58/0x110
[ 720.724100] [<ffffffff812cedb3>] ? security_file_permission+0x23/0xa0
[ 720.724675] [<ffffffff812258a9>] vfs_write+0xa9/0x1b0
[ 720.725275] [<ffffffff8102479c>] ? do_audit_syscall_entry+0x6c/0x70
[ 720.725849] [<ffffffff81226755>] SyS_write+0x55/0xd0
[ 720.726451] [<ffffffff8106a390>] ? do_page_fault+0x30/0x80
[ 720.727045] [<ffffffff816f2cae>] system_call_fastpath+0x12/0x71
The fifo code in upstream bcache can't use the last element in the buffer,
which was the cause of the bug: if you asked for a power of two size,
it'd give you a fifo that could hold one less than what you asked for
rather than allocating a buffer twice as big.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: stable@vger.kernel.org
register_cache() is supposed to return an error string on error so that
register_bcache() will will blkdev_put and cleanup other user counters,
but it does not set 'char *err' when cache_alloc() fails (eg, due to
memory pressure) and thus register_bcache() performs no cleanup.
register_bcache() <----------\ <- no jump to err_close, no blkdev_put()
| |
+->register_cache() | <- fails to set char *err
| |
+->cache_alloc() ---/ <- returns error
This patch sets `char *err` for this failure case so that register_cache()
will cause register_bcache() to correctly jump to err_close and do
cleanup. This was tested under OOM conditions that triggered the bug.
Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of a flag and an index just make sure an index of 0 means
no need to free the bvec array. Also move the constants related
to the bvec pools together and use a consistent naming scheme for
them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We have assigned sb->block_size before the switch,
so remove the redundant one.
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Eric Wheeler <bcache@lists.ewheeler.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
There is no return in continue_at(), update the documentation.
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Cache_sb is not used in cache_alloc, and we have copied
sb info to cache->sb already, remove it.
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
alloc_workqueue replaces deprecated create_workqueue().
Dedicated workqueues have been used since bcache_wq and moving_gc_wq
are workqueues for writes and are being used on a memory reclaim path.
WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary here.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have bcache
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We currently set REQ_WRITE/WRITE for all non READ IOs
like discard, flush, writesame, etc. In the next patches where we
no longer set up the op as a bitmap, we will not be able to
detect a operation direction like writesame by testing if REQ_WRITE is
set.
This has bcache use the op_is_write helper which will do the right
thing.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixed up fs/ext4/crypto.c
Signed-off-by: Jens Axboe <axboe@fb.com>
bch_gc_thread() doesn't mark itself freezable, so calling try_to_freeze()
in its context is just an expensive no-op.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
bch_allocator_thread() is calling try_to_freeze(), but that's just an
expensive no-op given the fact that the thread is not marked freezable.
Bucket allocator has to be up and running to the very last stages of the
suspend, as the bcache I/O that's in flight (think of writing an
hibernation image to a swap device served by bcache).
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
bch_writeback_thread() is calling try_to_freeze(), but that's just an
expensive no-op given the fact that the thread is not marked freezable.
I/O helper kthreads, exactly such as the bcache writeback thread, actually
shouldn't be freezable, because they are potentially necessary for
finalizing the image write-out.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
When bch_cache_set_alloc() fails to kzalloc the cache_set, the
asyncronous closure handling tries to dereference a cache_set that
hadn't yet been allocated inside of cache_set_flush() which is called
by __cache_set_unregister() during cleanup. This appears to happen only
during an OOM condition on bcache_register.
Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: stable@vger.kernel.org
The bch_writeback_thread might BUG_ON in read_dirty() if
dc->sb==BDEV_STATE_DIRTY and bch_sectors_dirty_init has not yet completed
its related initialization. This patch downs the dc->writeback_lock until
after initialization is complete, thus preventing bch_writeback_thread
from proceeding prematurely.
See this thread:
http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3453
Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Tested-by: Marc MERLIN <marc@merlins.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block driver updates from Jens Axboe:
"This is the block driver pull request for 4.5, with the exception of
NVMe, which is in a separate branch and will be posted after this one.
This pull request contains:
- A set of bcache stability fixes, which have been acked by Kent.
These have been used and tested for more than a year by the
community, so it's about time that they got in.
- A set of drbd updates from the drbd team (Andreas, Lars, Philipp)
and Markus Elfring, Oleg Drokin.
- A set of fixes for xen blkback/front from the usual suspects, (Bob,
Konrad) as well as community based fixes from Kiri, Julien, and
Peng.
- A 2038 time fix for sx8 from Shraddha, with a fix from me.
- A small mtip32xx cleanup from Zhu Yanjun.
- A null_blk division fix from Arnd"
* 'for-4.5/drivers' of git://git.kernel.dk/linux-block: (71 commits)
null_blk: use sector_div instead of do_div
mtip32xx: restrict variables visible in current code module
xen/blkfront: Fix crash if backend doesn't follow the right states.
xen/blkback: Fix two memory leaks.
xen/blkback: make st_ statistics per ring
xen/blkfront: Handle non-indirect grant with 64KB pages
xen-blkfront: Introduce blkif_ring_get_request
xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule()
xen/blkback: Free resources if connect_ring failed.
xen/blocks: Return -EXX instead of -1
xen/blkback: make pool of persistent grants and free pages per-queue
xen/blkback: get the number of hardware queues/rings from blkfront
xen/blkback: pseudo support for multi hardware queues/rings
xen/blkback: separate ring information out of struct xen_blkif
xen/blkfront: correct setting for xen_blkif_max_ring_order
xen/blkfront: make persistent grants pool per-queue
xen/blkfront: Remove duplicate setting of ->xbdev.
xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors.
xen/blkfront: negotiate number of queues/rings to be used with backend
xen/blkfront: split per device io_lock
...
Previously, it would only scan the entire disk if it was starting from
the very start of the disk - i.e. if the previous scan got to the end.
This was broken by refill_full_stripes(), which updates last_scanned so
that refill_dirty was never triggering the searched_from_start path.
But if we change refill_dirty() to always scan the entire disk if
necessary, regardless of what last_scanned was, the code gets cleaner
and we fix that bug too.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Added a safeguard in the shutdown case. At least while not being
attached it is also possible to trigger a kernel bug by writing into
writeback_running. This change adds the same check before trying to
wake up the thread for that case.
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Allows to use register, not register_quiet in udev to avoid "device_busy" error.
The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis
<g2p.code@gmail.com> does not unlock the mutex and hangs the kernel.
See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion.
Cc: Denis Bychkov <manover@gmail.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Gabriel de Perthuis <g2p.code@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
In bcache_init() function it forgot to unregister reboot notifier if
bcache fails to unregister a block device. This commit fixes this.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Subject : [PATCH v2] bcache: fix a livelock in btree lock
Date : Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM)
This commit tries to fix a livelock in bcache. This livelock might
happen when we causes a huge number of cache misses simultaneously.
When we get a cache miss, bcache will execute the following path.
->cached_dev_make_request()
->cached_dev_read()
->cached_lookup()
->bch->btree_map_keys()
->btree_root() <------------------------
->bch_btree_map_keys_recurse() |
->cache_lookup_fn() |
->cached_dev_cache_miss() |
->bch_btree_insert_check_key() -|
[If btree->seq is not equal to seq + 1, we should return
EINTR and traverse btree again.]
In bch_btree_insert_check_key() function we first need to check upgrade
flag (op->lock == -1), and when this flag is true we need to release
read btree->lock and try to take write btree->lock. During taking and
releasing this write lock, btree->seq will be monotone increased in
order to prevent other threads modify this in cache miss (see btree.h:74).
But if there are some cache misses caused by some requested, we could
meet a livelock because btree->seq is always changed by others. Thus no
one can make progress.
This commit will try to take write btree->lock if it encounters a race
when we traverse btree. Although it sacrifice the scalability but we
can ensure that only one can modify the btree.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Joshua Schmid <jschmid@suse.com>
Cc: Zhu Yanhai <zhu.yanhai@gmail.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block IO poll support from Jens Axboe:
"Various groups have been doing experimentation around IO polling for
(really) fast devices. The code has been reviewed and has been
sitting on the side for a few releases, but this is now good enough
for coordinated benchmarking and further experimentation.
Currently O_DIRECT sync read/write are supported. A framework is in
the works that allows scalable stats tracking so we can auto-tune
this. And we'll add libaio support as well soon. Fow now, it's an
opt-in feature for test purposes"
* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
direct-io: be sure to assign dio->bio_bdev for both paths
directio: add block polling support
NVMe: add blk polling support
block: add block polling support
blk-mq: return tag/queue combo in the make_request_fn handlers
block: change ->make_request_fn() and users to return a queue cookie
Pull trivial updates from Jiri Kosina:
"Trivial stuff from trivial tree that can be trivially summed up as:
- treewide drop of spurious unlikely() before IS_ERR() from Viresh
Kumar
- cosmetic fixes (that don't really affect basic functionality of the
driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek
- various comment / printk fixes and updates all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
bcache: Really show state of work pending bit
hwmon: applesmc: fix comment typos
Kconfig: remove comment about scsi_wait_scan module
class_find_device: fix reference to argument "match"
debugfs: document that debugfs_remove*() accepts NULL and error values
net: Drop unlikely before IS_ERR(_OR_NULL)
mm: Drop unlikely before IS_ERR(_OR_NULL)
fs: Drop unlikely before IS_ERR(_OR_NULL)
drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
UBI: Update comments to reflect UBI_METAONLY flag
pktcdvd: drop null test before destroy functions
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
WORK_STRUCT_PENDING is a mask for testing the pending bit.
test_bit() expects the number of the bit and we need to
use WORK_STRUCT_PENDING_BIT there.
Also work_data_bits() is defined in workqueues.h now.
I have noticed this just by chance when looking how
WORK_STRUCT_PENDING_BIT is used. The change is compile
tested.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The bcache driver has always accepted arbitrarily large bios and split
them internally. Now that every driver must accept arbitrarily large
bios this code isn't nessecary anymore.
Cc: linux-bcache@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Some drivers use it now, others just set the limits field manually.
But in preparation for splitting this into a hard and soft limit,
ensure that they all call the proper function for setting the hw
limit for discards.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This is horribly confusing, it breaks the flow of the code without
it being apparent in the caller.
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Kent's email address in MAINTAINERS seems to be invalid.
This was his last sign-off address, so use that if appropriate.
Fix the S: status entry while there.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kvfree() instead of open-coding it.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.
This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.
v2: fs/fat build failure fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
1) saving a bio's original bi_end_io
2) wiring up an intermediate bi_end_io
3) restoring the original bi_end_io from intermediate bi_end_io
4) calling bio_endio() to execute the restored original bi_end_io
The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called. For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0. When bio_inc_remaining() occurred during step
3 it brought it to a value of 2. When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).
Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining. For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.
Also, the bio_inc_remaining() interface has been moved local to bio.c.
Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.
If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.
Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Kent Overstreet <kmo@datera.io>
Signed-off-by: Jens Axboe <axboe@fb.com>
Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set
QUEUE_FLAG_NONROT.
Historically, all block devices have automatically made entropy
contributions. But as previously stated in commit e2e1a148 ("block: add
sysfs knob for turning off disk entropy contributions"):
- On SSD disks, the completion times aren't as random as they
are for rotational drives. So it's questionable whether they
should contribute to the random pool in the first place.
- Calling add_disk_randomness() has a lot of overhead.
There are more reliable sources for randomness than non-rotational block
devices. From a security perspective it is better to err on the side of
caution than to allow entropy contributions from unreliable "random"
sources.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
this is needed for the queue/block device we created (it's done by
blk_cleanup_queue() which we do call) - but calling it for the block devices we
only opened is pointless.
Change-Id: I53dfded14ed15b9581d10ca8399d5e1b3abbf9f2
Since bch_is_open will iterate linked list bch_cache_sets and
uncached_devices, it needs bch_register_lock.
Signed-off-by: Jianjian Huo <samuel.huo@gmail.com>
time_stats::btree_gc_max_duration_mc is not bit shifted by 8
Fixes BUG #138
Change-Id: I44fc6e1d0579674016acc533f1a546b080e5371a
Signed-off-by: Surbhi Palande <sap@daterainc.com>
If register_cache_set() failed, we would touch ca->set after
it had already been freed. Also, fix an assertion to catch
this.
Change-Id: I748e5f5b223e2d9b2602075dec2f997cced2394d
If we goto out_nocoalesce after we free new_nodes[0], we end up freeing
new_nodes[0] again. This was generating a lockdep warning. The fix is
to set new_nodes[0] to NULL, since the out_nocoalesce path safely
ignores NULL entries in the new_nodes array.
This regression was introduced in 2d7f9531.
Change-Id: I76564d7257800583214376b4bacf236cda90c89c
When running with multiple cache devices, if one of the devices has a completely
empty journal but we'd already found some journal entries on a previosu device
we'd go into an infinite loop.
Change-Id: I1dcdc0d738192746de28f40e8b08825b0dea5e2b
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
There's no point in blocking on these allocations, since our fallback paths will
probably go faster than blocking.
Change-Id: I733ca202c25cb36bde02607a0a60552229a4241c
this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT.
bcache uses GFP_NOWAIT in various other places where we have a fallback,
circuits must've gotten crossed when writing this code or something.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
There were two issues here:
- writeback thread did not start until the device first became dirty
- writeback thread used uninterruptible sleep once running
Without this patch I see kernel warnings printed and a load average of
1.52 after booting my test VM. With this patch the warnings are gone and
the load average is near 0.00 as expected.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Tested:
- sometimes bcache_tier test would hang on startup with a failure
to allocate the btree root -- no longer seeing this
Signed-off-by: Kent Overstreet <kmo@daterainc.com>