To maintain write-order fidelity accros all volumes in a DRBD resource,
the receiver of a P_BARRIER needs to issue flushes to all volumes.
We used to do this by calling blkdev_issue_flush(), synchronously,
one volume at a time.
We now submit all flushes to all volumes in parallel, then wait for all
completions, to reduce worst-case latencies on multi-volume resources.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The command line parameter the kernel module uses to communicate the
device minor to userland helper is flawed in a way that the device
indentifier "minor-%d" is being truncated to minors with a maximum
of 5 digits.
But DRBD 8.4 allows 2^20 == 1048576 minors,
thus a minimum of 7 digits must be supported.
Reported by Veit Wahlich on drbd-dev.
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Regression introduced with 8.4.5
drbd: application writes may set-in-sync in protocol != C
Overwriting the same block (LBA) while a former version is still
"in-flight" to the peer (to be exact: we did not receive the
P_BARRIER_ACK for its epoch yet) would wait for the full epoch of that
former version to be acknowledged by the peer.
In synchronous and quasi-synchronous protocols C and B,
this may double the latency on overwrites.
With protocol A, which is supposed to be asynchronous and only wait for
local completion, it is even worse: it would make overwrites
quasi-synchronous, they would be hit by the full RTT, which protocol A
was specifically meant to avoid, and possibly the additional time it
takes to drain the buffers first.
Particularly bad for databases, or anything else that
does frequent updates to the same blocks (various file system meta data).
No impact if >= rtt passes between updates to the same block.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If thinly provisioned volumes are used, during a resync the sync source
tries to find out if a block is deallocated. If it is deallocated, then
the resync target uses block_dev_issue_zeroout() on the range in
question.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
As long as the value is 0 the feature is disabled. With setting
it to a positive value, DRBD limits and aligns its resync requests
to the rs-discard-granularity setting. If the sync source detects
all zeros in such a block, the resync target discards the range
on disk.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If during resync we read only zeroes for a range of sectors assume
that these secotors can be discarded on the sync target node.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When leaving resync states because of disconnect,
do the bitmap write-out synchronously in the drbd_disconnected() path.
When leaving resync states because we go back to AHEAD/BEHIND, or
because resync actually finished, or some disk was lost during resync,
trigger the write-out from after_state_ch().
The bitmap write-out for resync -> ahead/behind was missing completely before.
Note that this is all only an optimization to avoid double-resyncs of
already completed blocks in case this node crashes.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The intention was to only suspend IO if some normal bitmap operation is
supposed to be locked out, not always. If the bulk operation is flaged
as BM_LOCKED_CHANGE_ALLOWED, we do not need to suspend IO.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We want to apply this to Fabrics drivers as well, so move it to common
code.
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Centralize the check if a given NVMe command reads or writes data.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Some transport drivers may have a lower transfer size than
the controller. So allow the transport to set it in the
controller max_hw_sectors.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add get_log_page command structure and a corresponding entry in
nvme_command union
Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed--by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
These have been added in NVMe 1.2 and we'll need at least oaes for the
NVMe target driver.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
alloc_workqueue replaces deprecated create_workqueue().
Dedicated workqueues have been used since bcache_wq and moving_gc_wq
are workqueues for writes and are being used on a memory reclaim path.
WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary here.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of overloading the discard support with the REQ_SECURE flag.
Use the opportunity to rename the queue flag as well, and remove the
dead checks for this flag in the RAID 1 and RAID 10 drivers that don't
claim support for secure erase.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Because we define WRITE/READ as REQ_OPs, we cannot do
switch (rq_data_dir(request))
case READ
....
case WRITE
...
without getting warnings about handling other REQ_OPs.
This just has mq_disk do a if/else like it does in other
places.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In case of the active namespace list scanning method, a namespace that
is detached is not removed from the host if it was the last entry in
the list. Fix this by adding a scan to validate namespaces greater than
the value of prev.
This also handles the case of removing namespaces whose value exceed
the device's reported number of namespaces.
Signed-off-by: Sunad Bhandary S <sunad.s@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
It's more elegant to use UINT_MAX to represent the max value of
type unsigned int. So replace the actual value by using this define.
Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
So it can be used by fabrics driver also.
Signed-off-by: Ming Lin <ming.l@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Keith Busch <keith.bsuch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
nvme_cancel_io is a bit confusing (given the distinction of io/admin),
so rename it to nvme_cancel_request.
And update it a bit to pass in struct nvme_ctrl, so it can be used
by Fabrics driver also.
Signed-off-by: Ming Lin <ming.l@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Keith Busch <keith.bsuch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The last patch added a REQ_OP_FLUSH for request_fn drivers
and the next patch renames REQ_FLUSH to REQ_PREFLUSH which
will be used by file systems and make_request_fn drivers so
they can send a write/flush combo.
This patch drops xen's use of REQ_FLUSH to track if it supports
REQ_OP_FLUSH requests, so REQ_FLUSH can be deleted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Juergen Gross <kernel@pfupf.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a REQ_OP_FLUSH operation that is sent to request_fn
based drivers by the block layer's flush code, instead of
sending requests with the request->cmd_flags REQ_FLUSH bit set.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch drops the compat definition of req_op where it matches
the rq_flag_bits definitions, and drops the related old and compat
code that allowed users to set either the op or flags for the operation.
We also then store the operation in the bi_rw/cmd_flags field similar
to how we used to store the bio ioprio where it sat in the upper bits
of the field.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We don't need bi_rw to be so large on 64 bit archs, so
reduce it to unsigned int.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In the next patch, we move drop the compat code and make
the op a separate value that is hidden in bi_rw. To give
the op and rq bits flags room to grow this moves prio to
its own field.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The block layer will set the correct READ/WRITE operation flags/fields
when creating a request, so there is not need for drivers to set the
REQ_WRITE flag.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Have blktrace use the req/bio op accessor to get the REQ_OP.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The req operation REQ_OP is separated from the rq_flag_bits
definition. This converts the block layer drivers to
use req_op to get the op from the request struct.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch converts the is_sync helpers to use separate variables
for the operation and flags.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch converts the block layer merging code to use separate variables
for the operation and flags, and to check req_op for the REQ_OP.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The bio and request operation and flags are going to be separate
definitions, so we cannot pass them in as a bitmap. This patch
converts the blkg_rwstat code and its caller, cfq, to pass in the
values separately.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch converts the elevator code to use separate variables
for the operation and flags, and to check req_op for the REQ_OP.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch modifies the blk mq request creation code to use
separate variables for the operation and flags, because in the
the next patches the struct request users will be converted like
was done for bios.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch prepares *_get_request/*_put_request and freed_request,
to use separate variables for the operation and flags. In the
next patches the struct request users will be converted like
was done for bios where the op and flags are set separately.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The bio users should now always be setting up the bio op. This patch
has the block layer copy that to the request.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have xen
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have the target layer
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have md
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have drbd
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have bcache
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have dm
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It looks like dm stats cares about the data direction
(READ vs WRITE) and does not need the bio/request flags.
Commands like REQ_FLUSH, REQ_DISCARD and REQ_WRITE_SAME
are currently always set with REQ_WRITE, so the extra check for
REQ_DISCARD in dm_stats_account_io is not needed.
This patch has it use the bio and request data_dir helpers
instead of accessing the bi_rw/cmd_flags directly. This makes
the next patches that remove the operation from the cmd_flags
and bi_rw easier, because we will no longer have the REQ_WRITE
bit set for operations like discards.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have the pm code
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have ocfs2
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>