Commit Graph

633371 Commits

Author SHA1 Message Date
Javier González 402ab9a89d lightnvm: add ECC error codes
Add ECC error codes to enable the appropriate handling in the target.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 12:12:51 -07:00
Javier González a24ba4644b lightnvm: export set bad block table
Bad blocks should be managed by block owners. This would be either
targets for data blocks or sysblk for system blocks.

In order to support this, export two functions: One to mark a block as
an specific type (e.g., bad block) and another to update the bad block
table on the device.

Move bad block management to rrpc.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 12:12:51 -07:00
Javier González 8a3c95ab38 lightnvm: do not protect block 0
Device blocks should be marked by the device and considered as bad
blocks by the media manager. Thus, do not make assumptions on which
blocks are going to be used by the device. In doing so we might lose
valid blocks from the free list.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 12:12:51 -07:00
Javier González bb3149792e lightnvm: enable to send hint to erase command
Erases might be subject to host hints. An example is multi-plane
programming to erase blocks in parallel. Enable targets to specify this
hint.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 12:12:51 -07:00
Matias Bjørling 3dc87dd048 nvme: lightnvm: attach lightnvm sysfs to nvme block device
Previously, LBA read and write were not supported in the lightnvm
specification. Now that it supports it, lets use the traditional
NVMe gendisk, and attach the lightnvm sysfs geometry export.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 12:12:51 -07:00
Matias Bjørling 7498e99fc5 nvme: lightnvm: frees wrong cmd structure
When struct nvme_request was introduced, the nvme_nvm_submit_io was
converted to the new interface. The interface moves nvme_nvm_command
data structure into the struct request pdu. On io completion, rq->cmd is
freed, which should have been the dereferenced pdu nvme_request->cmd.

Fixes: d49187e97e "nvme: introduce struct nvme_request"
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 12:12:51 -07:00
Gabriel Krisman Bertazi 415d3dab96 blk-mq: Drop explicit timeout sync in hotplug
After commit 287922eb0b ("block: defer timeouts to a workqueue"),
deleting the timeout work after freezing the queue shouldn't be
necessary, since the synchronization is already enforced by the
acquisition of a q_usage_counter reference in blk_mq_timeout_work.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 08:01:08 -07:00
Jens Axboe d62118b6dd blk-wbt: allow wbt to be enabled always through sysfs
Currently there's no way to enable wbt if it's not enabled in the
kernel config by default for a device. Allow a write to the
'wbt_lat_usec' queue sysfs file to enable wbt.

This is useful for both the kernel config case, but also if the
device is CFQ managed and it was turned off by default.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-28 10:27:03 -07:00
Jens Axboe fa224eed2b blk-wbt: cleanup disable-by-default for CFQ
Make it clear that we are disabling wbt for the specified queued,
if it was enabled by default. This is in preparation for allowing
users to re-enable wbt, and not have it disabled automatically
again.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-28 10:27:03 -07:00
Jens Axboe 80e091d10e blk-wbt: allow reset of default latency through sysfs
Allow a write of '-1' to reset the default latency target for
a given device. This removes knowledge of the different default
settings for rotational vs non-rotational from user space.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-28 10:27:03 -07:00
Jens Axboe feffa5cc7b nbd: fix setting of 'error' in NBD_DO_IT ioctl
Multiple paths don't set it properly, ensure that we do.

Fixes: 9561a7ade0 ("nbd: add multi-connection support")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 19:09:45 -07:00
Jens Axboe 63db89eaf0 nbd: move multi-connection bit to unused value
Bit #7 is already used, move to bit #8 which is the first unused
one.

Fixes: 9561a7ade0 ("nbd: add multi-connection support")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 13:11:55 -07:00
Josef Bacik 9561a7ade0 nbd: add multi-connection support
NBD can become contended on its single connection.  We have to serialize all
writes and we can only process one read response at a time.  Fix this by
allowing userspace to provide multiple connections to a single nbd device.  This
coupled with block-mq drastically increases performance in multi-process cases.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 12:16:32 -07:00
Tejun Heo e00f4f4d0f block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg
blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
when that fails falls back to operations which aren't specific to the
cgroup.  Occassional failures are expected under pressure and falling
back to non-cgroup operation is the right thing to do.

Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
these expected failures end up creating a lot of noise.  Add
__GFP_NOWARN.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marc MERLIN <marc@merlins.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:59:49 -07:00
Ming Lei 05aea81b4b fs: logfs: remove unnecesary check
The check on bio->bi_vcnt doesn't make sense in erase_end_io().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei c124843678 fs: logfs: use bio_add_page() in do_erase()
Also code gets simplified a bit.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei d4f98a89f9 fs: logfs: use bio_add_page() in __bdev_writeseg()
Also this patch simplify the code a bit.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei 739a997546 fs: logfs: convert to bio_add_page() in sync_request()
Always bio_add_page() is the standard and preferred way to
do the task.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei 4113b88a65 bcache: debug: avoid accessing .bi_io_vec directly
Instead we use standard iterator way to do that.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei 84c8590646 target: avoid accessing .bi_vcnt directly
When the bio is full, bio_add_pc_page() will return zero,
so use this information tell when the bio is full.

Also replace access to .bi_vcnt for pr_debug() with bio_segments().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei 2c73a603cd block: floppy: use bio_add_page()
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei 06efffda51 block: drbd: remove impossible failure handling
For a non-cloned bio, bio_add_page() only returns failure when
the io vec table is full, but in that case, bio->bi_vcnt can't
be zero at all.

So remove the impossible failure handling.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei 3a83f46775 block: bio: pass bvec table to bio_init()
Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.

After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

Fixed up the new O_DIRECT cases.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:21 -07:00
Jens Axboe 9a794fb9bd block_dev: get rid of blksize bits calculation
We store the bits in the bdev sector size locally, but we don't use
the calculation anymore. All we do with it is shift it back up to
the bdev sector size. So let's just use that directly and kill the
variable and bits calculation.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:56:25 -07:00
Damien Le Moal 4d1a476542 block_dev: Fixed direct I/O bio sector calculation
A direct I/O alignment must be always checked against the device blocks size,
but the I/O offset (bio->bi_iter.bi_sector must always use 512B sector unit, and
not the actual logical block size.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:09:09 -07:00
Shaun Tancheff 778889d841 block: apply blk_partition_remap to REQ_OP_ZONE_RESET
If a ZBC device is partitioned and operations are performed on the partition
the zone information is rebased to the partition, however the zone reset
is not mapped from the partition to device as are other operations.

This causes the API (report zones / reset zone) to be unbalanced in this
regard. Checking for the zone reset op code explicitly will balance the
API.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-21 15:08:24 -07:00
Christoph Hellwig 93c5bdf7ab block: clear all of bi_opf in bio_set_op_attrs
Since commit 87374179 ("block: add a proper block layer data direction
encoding") we only or the new op and flags into bi_opf in bio_set_op_attrs
instead of clearing the old value.  I've not seen any breakage with the
new behavior, but it seems dangerous.

Also convert it to an inline function to make the argument passing
safer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-21 09:35:05 -07:00
Jens Axboe 5a8b187c61 pktcdvd: mark as unmaintained and deprecated
This driver is both orphaned, and not really useful anymore. Mark
it as such, and remove it in a future kernel after a release or
two.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-21 09:33:17 -07:00
Tobias Klauser 9a05e7541c block: Change extern inline to static inline
With compilers which follow the C99 standard (like modern versions of
gcc and clang), "extern inline" does the opposite thing from older
versions of gcc (emits code for an externally linkable version of the
inline function).

"static inline" does the intended behavior in all cases instead.

Description taken from commit 6d91857d48 ("staging, rtl8192e,
LLVMLinux: Change extern inline to static inline").

This also fixes the following GCC warning when building with CONFIG_PM
disabled:

  ./include/linux/blkdev.h:1143:20: warning: no previous prototype for 'blk_set_runtime_active' [-Wmissing-prototypes]

Fixes: d07ab6d114 ("block: Add blk_set_runtime_active()")
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-18 07:44:23 -07:00
Geliang Tang 55f958cc6c skd_main: drop duplicate header scatterlist.h
Drop duplicate header scatterlist.h from skd_main.c.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-18 07:44:21 -07:00
Jens Axboe 10e6246e22 block: document the 'io_poll_delay' queue sysfs file
This was documented in the original commit, 64f1c21e86, but it
never made it into the proper location for queue sysfs files.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17 22:23:02 -07:00
Christoph Hellwig 542ff7bf18 block: new direct I/O implementation
Similar to the simple fast path, but we now need a dio structure to
track multiple-bio completions.  It's basically a cut-down version
of the new iomap-based direct I/O code for filesystems, but without
all the logic to call into the filesystem for extent lookup or
allocation, and without the complex I/O completion workqueue handler
for AIO - instead we just use the FUA bit on the bios to ensure
data is flushed to stable storage.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17 13:35:11 -07:00
Jens Axboe 78250c02d9 block: make __blkdev_direct_IO_sync() support O_SYNC/DSYNC
Split the op setting code into a helper, use it in both places.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17 13:35:05 -07:00
Jens Axboe 72ecad22d9 block: support a full bio worth of IO for simplified bdev direct-io
Just alloc the bio_vec array if we exceed the inline limit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17 13:35:02 -07:00
Jens Axboe 64f1c21e86 blk-mq: make the polling code adaptive
The previous commit introduced the hybrid sleep/poll mode. Take
that one step further, and use the completion latencies to
automatically sleep for half the mean completion time. This is
a good approximation.

This changes the 'io_poll_delay' sysfs file a bit to expose the
various options. Depending on the value, the polling code will
behave differently:

-1	Never enter hybrid sleep mode
 0	Use half of the completion mean for the sleep delay
>0	Use this specific value as the sleep delay

Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-17 13:34:57 -07:00
Jens Axboe 06426adf07 blk-mq: implement hybrid poll mode for sync O_DIRECT
This patch enables a hybrid polling mode. Instead of polling after IO
submission, we can induce an artificial delay, and then poll after that.
For example, if the IO is presumed to complete in 8 usecs from now, we
can sleep for 4 usecs, wake up, and then do our polling. This still puts
a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
after the IO has completed, it'll happen before. With this hybrid
scheme, we can achieve big latency reductions while still using the same
(or less) amount of CPU.

Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-17 13:34:51 -07:00
Christoph Hellwig 189ce2b9dc block: fast-path for small and simple direct I/O requests
This patch adds a small and simple fast patch for small direct I/O
requests on block devices that don't use AIO.  Between the neat
bio_iov_iter_get_pages helper that avoids allocating a page array
for get_user_pages and the on-stack bio and biovec this avoid memory
allocations and atomic operations entirely in the direct I/O code
(lower levels might still do memory allocations and will usually
have at least some atomic operations, though).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-17 13:34:45 -07:00
Jens Axboe 429a787be6 nbd: fix use-after-free of rq/bio in the xmit path
For writes, we can get a completion in while we're still iterating
the request and bio chain. If that happens, we're reading freed
memory and we can crash.

Break out after the last segment and avoid having the iterator
read freed memory.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17 12:30:37 -07:00
Arnd Bergmann 4121d385f1 blk-wbt: fix old-style function declaration
The newly added driver causes a harmless warning in some configurations:

block/blk-wbt.c:250:1: error: ‘inline’ is not at beginning of declaration [-Werror=old-style-declaration]
 static bool inline stat_sample_valid(struct blk_rq_stat *stat)

This makes it use the expected format for the declaration.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-16 08:32:40 -07:00
Yasuaki Ishimatsu 92153d30c7 null_blk: add usage hints for NVM
If CONFIG_NVM is disabled, loading null_block module with use_lightnvm=1
fails. But there are no messages and documents related to the failure.

Add the appropriate error message.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

Massaged the text a bit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-16 08:26:11 -07:00
Ming Lei 0a6219a95f block: deal with stale req count of plug list
In both legacy and mq path, req count of plug list is computed
before allocating request, so the number can be stale when falling
back to slept allocation, also the new introduced wbt can sleep
too.

This patch deals with the case by checking if plug list becomes
empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds'
which is introduced by Shaohua's patches of dispatching big request.

Fixes: 600271d900002(blk-mq: immediately dispatch big size request)
Fixes: 50d24c34403c6(block: immediately dispatch big size request)
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-16 08:09:51 -07:00
Omar Sandoval 2868f13c30 scsi_lib: untangle 0 and BLK_MQ_RQ_QUEUE_OK
Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
specific values. No functional change.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-15 12:50:31 -07:00
Omar Sandoval bac0000af5 nvme: untangle 0 and BLK_MQ_RQ_QUEUE_OK
Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
specific values. No functional change.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-15 12:50:11 -07:00
Omar Sandoval b4a567e811 loop: return proper error from loop_queue_rq()
->queue_rq() should return one of the BLK_MQ_RQ_QUEUE_* constants, not
an errno.

f4aa4c7bba ("block: loop: convert to per-device workqueue")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-14 15:58:44 -07:00
Damien Le Moal c6463c651d sd_zbc: Force use of READ16/WRITE16
Normally, sd_read_capacity sets sdp->use_16_for_rw to 1 based on the
disk capacity so that READ16/WRITE16 are used for large drives.
However, for a zoned disk with RC_BASIS set to 0, the capacity reported
through READ_CAPACITY may be very small, leading to use_16_for_rw not being
set and READ10/WRITE10 commands being used, even after the actual zoned disk
capacity is corrected in sd_zbc_read_zones. This causes LBA offset overflow for
accesses beyond 2TB.

As the ZBC standard makes it mandatory for ZBC drives to support
the READ16/WRITE16 commands anyway, make sure that use_16_for_rw is set.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
eviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-14 13:16:42 -07:00
Bart Van Assche dbb3ab0356 bsg: Add sparse annotations to bsg_request_fn()
Avoid that sparse complains about unbalanced lock actions.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-14 09:57:03 -07:00
Jens Axboe 382cf633ed blk-wbt: use BLK_STAT_{READ,WRITE} instead of 0/1
Since we have proper enums for the stats directions, use them.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-11 16:18:29 -07:00
Jens Axboe 8054b89f8f blk-wbt: remove stat ops
Again a leftover from when the throttling code was generic. Now that we
just have the block user, get rid of the stat ops and indirections.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-11 16:18:24 -07:00
Jens Axboe d8a0cbfd73 blk-wbt: store queue instead of bdi
The bdi was a leftover from when the code was block layer agnostic.
Now that we just support a block layer user, store the queue directly.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-11 16:18:18 -07:00
Jens Axboe bbd7bb7017 block: move poll code to blk-mq
The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-11-11 13:40:25 -07:00