Commit Graph

602025 Commits

Author SHA1 Message Date
Lars Ellenberg 20004e2435 drbd: bump current uuid when resuming IO with diskless peer
Scenario, starting with normal operation
 Connected Primary/Secondary UpToDate/UpToDate
 NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
 ... more failures happen, secondary loses it's disk,
 but eventually is able to re-establish the replication link ...
 Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!)

We used to just resume/resent suspended requests,
without bumping the UUID.

Which will lead to problems later, when we want to re-attach the disk on
the peer, without first disconnecting, or if we experience additional
failures, because we now have diverging data without being able to
recognize it.

Make sure we also bump the current data generation UUID,
if we notice "peer disk unknown" -> "peer disk known bad".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg 31d646042d drbd: disallow promotion during resync handshake, avoid deadlock and hard reset
We already serialize connection state changes,
and other, non-connection state changes (role changes)
while we are establishing a connection.

But if we have an established connection,
then trigger a resync handshake (by primary --force or similar),
until now we just had to be "lucky".

Consider this sequence (e.g. deployment scenario):
create-md; up;
  -> Connected Secondary/Secondary Inconsistent/Inconsistent
then do a racy primary --force on both peers.

 block drbd0: drbd_sync_handshake:
 block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
 block drbd0: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
 block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent )
 block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
  *** HERE things go wrong. ***
 block drbd0: role( Secondary -> Primary )
 block drbd0: drbd_sync_handshake:
 block drbd0: self 0000000000000005:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
 block drbd0: peer C90D2FC716D232AB:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
 block drbd0: Becoming sync target due to disk states.
 block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.
 block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * timeout (30 * 0.1s)
 drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )

The problem here is that the local promotion happens before the sync handshake
triggered by the remote promotion was completed.  Some assumptions elsewhere
become wrong, and when the expected resync handshake is then received and
processed, we get stuck in a deadlock, which can only be recovered by reboot :-(

Fix: if we know the peer has good data,
and our own disk is present, but NOT good,
and there is no resync going on yet,
we expect a sync handshake to happen "soon".
So reject a racy promotion with SS_IN_TRANSIENT_STATE.

Result:
 ... as above ...
 block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
  *** local promotion being postponed until ... ***
 block drbd0: drbd_sync_handshake:
 block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
 block drbd0: peer 77868BDA836E12A5:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
  ...
 block drbd0: conn( WFBitMapT -> WFSyncUUID )
 block drbd0: updated sync uuid 85D06D0E8887AD44:0000000000000000:0000000000000000:0000000000000000
 block drbd0: conn( WFSyncUUID -> SyncTarget )
  *** ... after the resync handshake ***
 block drbd0: role( Secondary -> Primary )

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg f2d3d75b66 drbd: sync_handshake: handle identical uuids with current (frozen) Primary
If in a two-primary scenario, we lost our peer, freeze IO,
and are still frozen (no UUID rotation) when the peer comes back
as Secondary after a hard crash, we will see identical UUIDs.

The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as
arbitration, but that would cause the still running (but frozen) Primary
to become SyncTarget (which it typically refuses), and the handshake is
declined.

Fix: check current roles.
If we have *one* current primary, the Primary wins.
(rule_nr = 41)

Since that is a protocol change, use the newly introduced DRBD_FF_WSAME
to determine if rule_nr = 41 can be applied.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg 9104d31a75 drbd: introduce WRITE_SAME support
We will support WRITE_SAME, if
 * all peers support WRITE_SAME (both in kernel and DRBD version),
 * all peer devices support WRITE_SAME
 * logical_block_size is identical on all peers.

We may at some point introduce a fallback on the receiving side
for devices/kernels that do not support WRITE_SAME,
by open-coding a submit loop. But not yet.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg 60bac04012 drbd: report sizes if rejecting too small peer disk
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg 65f5be3579 drbd: discard_zeroes_if_aligned allows "thin" resync for discard_zeroes_data=0
Even if discard_zeroes_data != 0,
if discard_zeroes_if_aligned is set, we assume we can reliably
zero-out/discard using the drbd_issue_peer_discard() helper.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg af61494ad4 drbd: only restart frozen disk io when D_UP_TO_DATE
When re-attaching the local backend device to a C_STANDALONE D_DISKLESS
R_PRIMARY with OND_SUSPEND_IO, we may only resume IO if we recognize the
backend that is being attached as D_UP_TO_DATE.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg 0ead5cca3d drbd: if there is no good data accessible, writes should be IO errors
If DRBD lost all path to good data,
and the on-no-data-accessible policy is OND_SUSPEND_IO,
all pending and new IO requests are suspended (will block).

If that setting is OND_IO_ERROR, IO will still be completed.
READ to "clean" areas (e.g. on an D_INCONSISTENT device,
and bitmap indicates a block is already in sync) will succeed.
READ to "unclean" areas (bitmap indicates block is out-of-sync),
will return EIO.

If we are already D_DISKLESS (or D_FAILED), we also return EIO.

Unfortunately, on a former R_PRIMARY C_SYNC_TARGET D_INCONSISTENT,
after replication link loss, new WRITE requests still went through OK.

The would also set the "out-of-sync" bit on their way, so READ after
WRITE would still return EIO. Also, the data generation UUIDs had not
been bumped, we would cause data divergence, without being able to
detect it on the next sync handshake, given the right sequence of events
in a multiple error scenario and "improper" order of recovery actions.

The right thing to do is to return EIO for all new writes,
unless we have access to good, current, D_UP_TO_DATE data.

The "established best practices" way to avoid these situations in the
first place is to set OND_SUSPEND_IO, or even do a hard-reset from
the pri-on-incon-degr policy helper hook.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg 7bd000cb0c drbd: don't forget error completion when "unsuspending" IO
Possibly sequence of events:
SyncTarget is made Primary, then loses replication link
(only path to good data on SyncSource).

Behavior is then controlled by the on-no-data-accessible policy,
which defaults to OND_IO_ERROR (may be set to OND_SUSPEND_IO).

If OND_IO_ERROR is in fact the current policy, we clear the susp_fen
(IO suspended due to fencing policy) flag, do NOT set the susp_nod
(IO suspended due to no data) flag.

But we forgot to call the IO error completion for all pending,
suspended, requests.

While at it, also add a race check for a theoretically possible
race with a new handshake (network hickup), we may be able to
re-send requests, and can avoid passing IO errors up the stack.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg 26a96110ab drbd: introduce unfence-peer handler
When resync is finished, we already call the "after-resync-target"
handler (on the former sync target, obviously), once per volume.

Paired with the before-resync-target handler, you can create snapshots,
before the resync causes the volumes to become inconsistent,
and discard those snapshots again, once they are no longer needed.

It was also overloaded to be paired with the "fence-peer" handler,
to "unfence" once the volumes are up-to-date and known good.

This has some disadvantages, though: we call "fence-peer" for the whole
connection (once for the group of volumes), but would call unfence as
side-effect of after-resync-target once for each volume.

Also, we fence on a (current, or about to become) Primary,
which will later become the sync-source.

Calling unfence only as a side effect of the after-resync-target
handler opens a race window, between a new fence on the Primary
(SyncTarget) and the unfence on the SyncTarget, which is difficult to
close without some kind of "cluster wide lock" in those handlers.

We would not need those handlers if we could still communicate.
Which makes trying to aquire a cluster wide lock from those handlers
seem like a very bad idea.

This introduces the "unfence-peer" handler, which will be called
per connection (once for the group of volumes), just like the fence
handler, only once all volumes are back in sync, and on the SyncSource.

Which is expected to be the node that previously called "fence", the
node that is currently allowed to be Primary, and thus the only node
that could trigger a new "fence" that could race with this unfence.

Which makes us not need any cluster wide synchronization here,
serializing two scripts running on the same node is trivial.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg 5052fee2c7 drbd: finish resync on sync source only by notification from sync target
If the replication link breaks exactly during "resync finished" detection,
finishing too early on the sync source could again lead to UUIDs rotated
too fast, and potentially a spurious full resync on next handshake.

Always wait for explicit resync finished state change notification from
the sync target.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Lars Ellenberg 505675f96c drbd: allow larger max_discard_sectors
Make sure we have at least 67 (> AL_UPDATES_PER_TRANSACTION)
al-extents available, and allow up to half of that to be
discarded in one bio.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Lars Ellenberg 7435e9018f drbd: zero-out partial unaligned discards on local backend
For consistency, also zero-out partial unaligned chunks of discard
requests on the local backend.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Lars Ellenberg 69ba1ee936 drbd: possibly disable discard support, if backend has discard_zeroes_data=0
Now that we have the discard_zeroes_if_aligned setting, we should also
check it when setting up our queue parameters on the primary,
not only on the receiving side.

We announce discard support,
UNLESS

 * we are connected to a peer that does not support TRIM
   on the DRBD protocol level.  Otherwise, it would either discard, or
   do a fallback to zero-out, depending on its backend and configuration.

 * our local backend does not support discards,
   or (discard_zeroes_data=0 AND discard_zeroes_if_aligned=no).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Lars Ellenberg dd4f699da6 drbd: when receiving P_TRIM, zero-out partial unaligned chunks
We can avoid spurious data divergence caused by partially-ignored
discards on certain backends with discard_zeroes_data=0, if we
translate partial unaligned discard requests into explicit zero-out.

The relevant use case is LVM/DM thin.

If on different nodes, DRBD is backed by devices with differing
discard characteristics, discards may lead to data divergence
(old data or garbage left over on one backend, zeroes due to
unmapped areas on the other backend). Online verify would now
potentially report tons of spurious differences.

While probably harmless for most use cases (fstrim on a file system),
DRBD cannot have that, it would violate our promise to upper layers
that our data instances on the nodes are identical.

To be correct and play safe (make sure data is identical on both copies),
we would have to disable discard support, if our local backend (on a
Primary) does not support "discard_zeroes_data=true".

We'd also have to translate discards to explicit zero-out on the
receiving (typically: Secondary) side, unless the receiving side
supports "discard_zeroes_data=true".

Which both would allocate those blocks, instead of unmapping them,
in contrast with expectations.

LVM/DM thin does set discard_zeroes_data=0,
because it silently ignores discards to partial chunks.

We can work around this by checking the alignment first.
For unaligned (wrt. alignment and granularity) or too small discards,
we zero-out the initial (and/or) trailing unaligned partial chunks,
but discard all the aligned full chunks.

At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".

Arguably it should behave this way internally, by default,
and we'll try to make that happen.

But our workaround is still valid for already deployed setups,
and for other devices that may behave this way.

Setting discard-zeroes-if-aligned=yes will allow DRBD to use
discards, and to announce discard_zeroes_data=true, even on
backends that announce discard_zeroes_data=false.

Setting discard-zeroes-if-aligned=no will cause DRBD to always
fall-back to zero-out on the receiving side, and to not even
announce discard capabilities on the Primary, if the respective
backend announces discard_zeroes_data=false.

We used to ignore the discard_zeroes_data setting completely.
To not break established and expected behaviour, and suddenly
cause fstrim on thin-provisioned LVs to run out-of-space,
instead of freeing up space, the default value is "yes".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Lars Ellenberg f9ff0da564 drbd: allow parallel flushes for multi-volume resources
To maintain write-order fidelity accros all volumes in a DRBD resource,
the receiver of a P_BARRIER needs to issue flushes to all volumes.
We used to do this by calling blkdev_issue_flush(), synchronously,
one volume at a time.

We now submit all flushes to all volumes in parallel, then wait for all
completions, to reduce worst-case latencies on multi-volume resources.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Lars Ellenberg 0982368bfd drbd: fix for truncated minor number in callback command line
The command line parameter the kernel module uses to communicate the
device minor to userland helper is flawed in a way that the device
indentifier "minor-%d" is being truncated to minors with a maximum
of 5 digits.

But DRBD 8.4 allows 2^20 == 1048576 minors,
thus a minimum of 7 digits must be supported.

Reported by Veit Wahlich on drbd-dev.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:04 -06:00
Lars Ellenberg 1b228c98ce drbd: fix regression: protocol A sometimes synchronous, C sometimes double-latency
Regression introduced with 8.4.5
 drbd: application writes may set-in-sync in protocol != C

Overwriting the same block (LBA) while a former version is still
"in-flight" to the peer (to be exact: we did not receive the
P_BARRIER_ACK for its epoch yet) would wait for the full epoch of that
former version to be acknowledged by the peer.

In synchronous and quasi-synchronous protocols C and B,
this may double the latency on overwrites.

With protocol A, which is supposed to be asynchronous and only wait for
local completion, it is even worse: it would make overwrites
quasi-synchronous, they would be hit by the full RTT, which protocol A
was specifically meant to avoid, and possibly the additional time it
takes to drain the buffers first.

Particularly bad for databases, or anything else that
does frequent updates to the same blocks (various file system meta data).

No impact if >= rtt passes between updates to the same block.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:04 -06:00
Lars Ellenberg bca1cbaeac drbd: adjust assert in w_bitmap_io to account for BM_LOCKED_CHANGE_ALLOWED
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:04 -06:00
Philipp Reisner 92d94ae66a drbd: Create the protocol feature THIN_RESYNC
If thinly provisioned volumes are used, during a resync the sync source
tries to find out if a block is deallocated. If it is deallocated, then
the resync target uses block_dev_issue_zeroout() on the range in
question.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:04 -06:00
Philipp Reisner a5ca66c419 drbd: Introduce new disk config option rs-discard-granularity
As long as the value is 0 the feature is disabled. With setting
it to a positive value, DRBD limits and aligns its resync requests
to the rs-discard-granularity setting. If the sync source detects
all zeros in such a block, the resync target discards the range
on disk.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:04 -06:00
Philipp Reisner 700ca8c04a drbd: Implement handling of thinly provisioned storage on resync target nodes
If during resync we read only zeroes for a range of sectors assume
that these secotors can be discarded on the sync target node.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:04 -06:00
Philipp Reisner c5c2385481 drbd: Kill code duplication
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:03 -06:00
Lars Ellenberg be115b69f1 drbd: change bitmap write-out when leaving resync states
When leaving resync states because of disconnect,
do the bitmap write-out synchronously in the drbd_disconnected() path.

When leaving resync states because we go back to AHEAD/BEHIND, or
because resync actually finished, or some disk was lost during resync,
trigger the write-out from after_state_ch().

The bitmap write-out for resync -> ahead/behind was missing completely before.

Note that this is all only an optimization to avoid double-resyncs of
already completed blocks in case this node crashes.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:03 -06:00
Lars Ellenberg c0065f98d5 drbd: bitmap bulk IO: do not always suspend IO
The intention was to only suspend IO if some normal bitmap operation is
supposed to be locked out, not always. If the bulk operation is flaged
as BM_LOCKED_CHANGE_ALLOWED, we do not need to suspend IO.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:03 -06:00
Christoph Hellwig f5fa90dc0a nvme: move the workaround for I/O queue-less controllers from PCIe to core
We want to apply this to Fabrics drivers as well, so move it to common
code.

Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Christoph Hellwig 7a5abb4b48 nvme: factor out a add nvme_is_write helper
Centralize the check if a given NVMe command reads or writes data.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Christoph Hellwig a229dbf61e nvme: allow for size limitations from transport drivers
Some transport drivers may have a lower transfer size than
the controller. So allow the transport to set it in the
controller max_hw_sectors.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
James Smart 3972be23bd nvme.h: add constants for PSDT and FUSE values
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Christoph Hellwig 79f370eac6 nvme.h: add AER constants
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Christoph Hellwig 69cd27e251 nvme.h: add NVM command set SQE/CQE size defines
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Armen Baloyan 725b358836 nvme.h: Add get_log_page command strucure
Add get_log_page command structure and a corresponding entry in
nvme_command union

Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed--by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Christoph Hellwig 14e974a84e nvme.h: add RTD3R, RTD3E and OAES fields
These have been added in NVMe 1.2 and we'll need at least oaes for the
NVMe target driver.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Bhaktipriya Shridhar 81baf90af2 bcache: Remove deprecated create_workqueue
alloc_workqueue replaces deprecated create_workqueue().

Dedicated workqueues have been used since bcache_wq and moving_gc_wq
are workqueues for writes and are being used on a memory reclaim path.
WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary here.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-11 20:03:04 -06:00
Christoph Hellwig 288dab8a35 block: add a separate operation type for secure erase
Instead of overloading the discard support with the REQ_SECURE flag.
Use the opportunity to rename the queue flag as well, and remove the
dead checks for this flag in the RAID 1 and RAID 10 drivers that don't
claim support for secure erase.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-09 09:52:25 -06:00
Mike Christie 56332f02a5 mg_disk: fix enum REQ_OP_ kbuild error
Because we define WRITE/READ as REQ_OPs, we cannot do
switch (rq_data_dir(request))
case READ
....
case WRITE
...

without getting warnings about handling other REQ_OPs.

This just has mq_disk do a if/else like it does in other
places.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-08 15:01:16 -06:00
Sunad Bhandary 47b0e50ac7 NVMe: Fix removal in case of active namespace list scanning method
In case of the active namespace list scanning method, a namespace that
is detached is not removed from the host if it was the last entry in
the list. Fix this by adding a scan to validate namespaces greater than
the value of prev.

This also handles the case of removing namespaces whose value exceed
the device's reported number of namespaces.

Signed-off-by: Sunad Bhandary S <sunad.s@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-08 08:48:47 -06:00
Minfei Huang bd0fc2884c nvme: use UINT_MAX for max discard sectors
It's more elegant to use UINT_MAX to represent the max value of
type unsigned int. So replace the actual value by using this define.

Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 22:23:12 -06:00
Ming Lin c55a2fd4bb nvme: move nvme_cancel_request() to common code
So it can be used by fabrics driver also.

Signed-off-by: Ming Lin <ming.l@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Keith Busch <keith.bsuch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:43:02 -06:00
Ming Lin e1958e6534 nvme: update and rename nvme_cancel_io to nvme_cancel_request
nvme_cancel_io is a bit confusing (given the distinction of io/admin),
so rename it to nvme_cancel_request.

And update it a bit to pass in struct nvme_ctrl, so it can be used
by Fabrics driver also.

Signed-off-by: Ming Lin <ming.l@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Keith Busch <keith.bsuch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:43:02 -06:00
Mike Christie 28a8f0d317 block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie a418090aa8 block: do not use REQ_FLUSH for tracking flush support
The last patch added a REQ_OP_FLUSH for request_fn drivers
and the next patch renames REQ_FLUSH to REQ_PREFLUSH which
will be used by file systems and make_request_fn drivers so
they can send a write/flush combo.

This patch drops xen's use of REQ_FLUSH to track if it supports
REQ_OP_FLUSH requests, so REQ_FLUSH can be deleted.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Juergen Gross <kernel@pfupf.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 3a5e02ced1 block, drivers: add REQ_OP_FLUSH operation
This adds a REQ_OP_FLUSH operation that is sent to request_fn
based drivers by the block layer's flush code, instead of
sending requests with the request->cmd_flags REQ_FLUSH bit set.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 4e1b2d52a8 block, fs, drivers: remove REQ_OP compat defs and related code
This patch drops the compat definition of req_op where it matches
the rq_flag_bits definitions, and drops the related old and compat
code that allowed users to set either the op or flags for the operation.

We also then store the operation in the bi_rw/cmd_flags field similar
to how we used to store the bio ioprio where it sat in the upper bits
of the field.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 6296b9604f block, drivers, fs: shrink bi_rw from long to int
We don't need bi_rw to be so large on 64 bit archs, so
reduce it to unsigned int.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 43b62ce3ff block: move bio io prio to a new field
In the next patch, we move drop the compat code and make
the op a separate value that is hidden in bi_rw. To give
the op and rq bits flags room to grow this moves prio to
its own field.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 8e45c6f880 ide cd: do not set REQ_WRITE on requests.
The block layer will set the correct READ/WRITE operation flags/fields
when creating a request, so there is not need for drivers to set the
REQ_WRITE flag.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 1b9a9ab78b blktrace: use op accessors
Have blktrace use the req/bio op accessor to get the REQ_OP.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie c2df40dfb8 drivers: use req op accessor
The req operation REQ_OP is separated from the rq_flag_bits
definition. This converts the block layer drivers to
use req_op to get the op from the request struct.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie d9d8c5c489 block: convert is_sync helpers to use REQ_OPs.
This patch converts the is_sync helpers to use separate variables
for the operation and flags.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00