drbd_set_role(, R_PRIMARY, ) does the state change to Primary,
some more housekeeping, and possibly generates a new UUID set.
All of this holding the "state_mutex".
The connection handshake involves sending of various state information,
including the current data generation UUID set, and two connection
state changes from C_WF_CONNECTION to C_WF_REPORT_PARAMS further to
a number of different outcomes, resync being one of them.
If the connection handshake happens between the state change to Primary
and the generation of the new UUIDs, the resync decision based on the
old UUID set may be confused, depending on circumstances.
Make sure that, before we do the handshake, any promotion to Primary
role will either be complete (including the housekeeping stuff), or can
see, and serialize with, the ongoing handshake, based on the
"STATE_SENT" bit, which is set when we start the handshake, and cleared
only when we leave C_WF_REPORT_PARAMS again.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The 8.3.12 commit drbd: Bugfix for the connection behavior fixes a
"wasted established connection", if a former connection attempt failed
during its early stages.
However it opened a window for a regression, if a connection attempt
fails during its last stages. The result was a terminated receiver
thread, that left behind the supposedly transient "C_UNCONNECTED" state.
Any later requests to change the connection state fail, as they wait for
the connection state to "stabilize".
Fix: short circuit and keep retrying to restablish a new connection,
if we don't reach C_WF_REPORT_PARAMS.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jing Wang <windsdaemon@gmail.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Without this, the meta-data gets updates after 5 seconds by the
md_sync_timer. Better to do it immeditaly after a state change.
If the asender detects a network failure, it may take a bit until
the worker processes the according after-conn-state-change work item.
The worker might be blocked in sending something, i.e. it
takes until it gets into its timeout. That is 6 seconds by
default which is longer than the 5 seconds of the md_sync_timer.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
drbd_disconnected() is supposed to clear the resync lru cache,
by calling drbd_rs_cancel_all().
We must do so before we call drbd_flush_workqueue(), as at least the
callback w_restart_disk_io() may wait for resync progres, and would
otherwise deadlock.
drbd_finish_peer_reqs() may again populate that cache, which will
then potentially be stale after the next resync handshake and bitmap
exchange, we have to do it again after that.
A stale resync lru cache causes no harm but ugly messages like this:
BAD! sector=196608s enr=6 rs_left=-256 rs_failed=0 count=256 cstate=SyncTarget
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Disconnecting is a cluster wide state change. In case the peer node agrees
to the state transition, it sends back the fact on the meta-data connection
and closes both sockets.
In case the node node that initiated the state transfer sees the closing
action on the data-socket, before the P_STATE_CHG_REPLY packet, it was
going into one of the network failure states.
At least with the fencing option set to something else thatn "dont-care",
the unclean shutdown of the connection causes a short IO freeze or
a fence operation.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
There is at least the worker context, the receiver context, the context of
receiving netlink packts.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We don't discard anything here, really.
We resolve conflicting, concurrent writes to overlapping data blocks.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
To avoid confusion with REQ_DISCARD aka TRIM, rename our
"discard concurrent write acks" from P_DISCARD_WRITE to P_SUPERSEDED.
At the same time, rename the drbd request event DISCARD_WRITE
to CONFLICT_RESOLVED. It already triggers both successful completion
or restart of the request, depending on our RQ_POSTPONED flag.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The DISCARD_CONCURRENT flag should be set on one node and cleared on the
other node.
As the code was before it was theoretical possible that a node accepts the
meta socket, but has to close it later on, and keeps the DISCARD_CONCURRENT
flag.
Correct this by moving the clear_bit(DISCARD_CONCURRENT) where the packet
gets sent.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Since the listening socket is open all the time, it was possible to
get into stable "initial packet S crossed" loops.
* when both sides realize in the drbd_socket_okay() call at the end
of the loop that the other side closed the main socket you had
the chance to get into a stable loop with repeated "packet S crossed"
messages.
* when both sides do not realize with the drbd_socket_okay() call at the end
of the loop that the other side closed the main socket you had
the chance to get into a stable loop with alternating "packet S crossed"
"packet M crossed" messages.
In order to break out these stable loops randomize the behaviour if
such a crossing of P_INITIAL_DATA or P_INITIAL_META packets is detected.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Since now our listening socket is open all the time we will get
connection tries of the peer always in. No need to try it three
times.
This is valid when connecting to older peers as well, it simply
increases the probability that the new version DRBD will accept
a connection instead that it will establish one.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Since the drbd_socket_okay() function itself tests if the the
socket is NULL, the explicit test "if (sock.socket && &msock.socket)"
was redundent.
Apart from that the address opperator ('&') before msock.socket rendered
the test pointless.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
In 8.4, we may have bios spanning two activity log extents.
Fixup drbd_al_begin_io() and drbd_al_complete_io() to deal with zero sized bios.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We now can schedule only a specific range of sectors for online verify,
or interrupt a running verify without interrupting the connection.
Had to bump the protocol version differently, we are now 101.
Added verify_can_do_stop_sector() { protocol >= 97 && protocol != 100; }
Also, the return value convention for worker callbacks has changed,
we returned "true/false" for "keep the connection up" in 8.3,
we return 0 for success and <= for failure in 8.4.
Affected: receive_state()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
cherry-picked and adapted from drbd 9 devel branch
DRBD requests (struct drbd_request) are already on the per resource
transfer log list, and carry their epoch number. We do not need to
additionally link them on other ring lists in other structs.
The drbd sender thread can recognize itself when to send a P_BARRIER,
by tracking the currently processed epoch, and how many writes
have been processed for that epoch.
If the epoch of the request to be processed does not match the currently
processed epoch, any writes have been processed in it, a P_BARRIER for
this last processed epoch is send out first.
The new epoch then becomes the currently processed epoch.
To not get stuck in drbd_al_begin_io() waiting for P_BARRIER_ACK,
the sender thread also needs to handle the case when the current
epoch was closed already, but no new requests are queued yet,
and send out P_BARRIER as soon as possible.
This is done by comparing the per resource "current transfer log epoch"
(tconn->current_tle_nr) with the per connection "currently processed
epoch number" (tconn->send.current_epoch_nr), while waiting for
new requests to be processed in wait_for_work().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
cherry-picked and adapted from drbd 9 devel branch
In 8.4, we don't distinguish between "resource work" and "connection
work" yet, we have one worker for both, as we still have only one connection.
We only ever used the "data.work",
no need to keep the "meta.work" around.
Move tconn->data.work to tconn->sender_work.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
This could be exploited by a peer which runs modified code.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Changes to the role and disk state should be delayed or rejected
while we establish a connection.
This is necessary, since the peer will base its resync decision
on the UUIDs and the state we sent in the drbd_connect() function.
The most prominent example for this race is becoming primary after
sending state and UUIDs and before the state changes to C_WF_CONNECTION.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Since drbd_bump_write_ordering() is called in the attaching
process while the disk state is D_ATTACHING, it was not
considering these three flags during attach.
A call to this function was missing form drbd_adm_disk_opts().
Fixed both issues.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Transfer log epochs, and therefore P_BARRIER packets,
are per resource, not per volume.
We must not associate them with "some random volume".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
w_restart_write(), run from worker context, calls __drbd_make_request()
and further drbd_al_begin_io(, delegate=true), which then
potentially deadlocks. The previous patch moved a BUG_ON to expose
such call paths, which would now be triggered.
Also, if we call __drbd_make_request() from resource worker context,
like w_restart_write() did, and that should block for whatever reason
(!drbd_state_is_stable(), resource suspended, ...),
we potentially deadlock the whole resource, as the worker
is needed for state changes and other things.
Create a dedicated retry workqueue for this instead.
Also make sure that inc_ap_bio()/dec_ap_bio() are properly paired,
even if do_retry() needs to retry itself,
in case __drbd_make_request() returns != 0.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
drbd_accept was modelled after kernel_accept
with drbd commit 53eb779 in July 2008.
Only, kernel_accept was then broken, and only fixed later
with kernel commit 1b08534e in Dec 2008:
net: Fix module refcount leak in kernel_accept()
Impact: protocol families provided as modules, e.g. ipv6 or ib_sdp,
would soon have their reference count become negative, preventing
them from being unloaded (likely), or worse, hit zero without actually
being unused, allowing them to be unloaded while still in use (unlikely,
but if triggered, causing a kernel crash).
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
For compatibility reasons 8.4 has to send P_STATE_CHG_REQ (instead
of P_CONN_ST_CHG_REQ) when disconnecting.
In the receiving code path we missed to convert the old
answer (P_STATE_CHG_REPLY) back to 8.4 logic. Therefore
the CL_ST_CHG_SUCCESS or CL_ST_CHG_FAIL bit in the flags word
of mdev got set, while the state code was waiting for
the CONN_WD_ST_CHG_OKAY or CONN_WD_ST_CHG_FAIL bits in tconn.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
With commit from Mon Mar 28 16:33:12 2011 +0200
"drbd: drbd_connect(): Initialize struct drbd_socket before sending anything"
tconn->data.sock and tconn->meta.sock get assigned early, in
conn_connect.
The early assigning can trigger an OOPS, because it may released the socket
without acquiring the mutex protecting the socket. An other thread (worker)
might use setsockopt() on the socket while it gets free()ed.
Restored the (proven) 8.3 behavior of assigning these sockets after the two
connections are established.
Credits for reporting the issue are going to Arne Redlich.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If the asender thread, or request_timer_fn(), or some other part of
the code, decided to drop the connection (because of timeout or other),
but the receiver just now was processing a P_STATE packet, there was a
chance that receive_state() would do a hard state change
"re-establishing" an already failed connection without additional handshake.
Log excerpt:
Remote failed to finish a request within ko-count * timeout
peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
asender terminated
...
peer( Unknown -> Secondary ) conn( Timeout -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
...
Connection closed
peer( Secondary -> Unknown ) conn( Connected -> Unconnected ) pdsk( UpToDate -> DUnknown ) peer_isp( 1 -> 0 )
receiver terminated
Impact:
while the connection state is erroneously "Connected",
requests may be queued and even sent,
which would never be acknowledged,
and may have been missed by the cleanup.
These requests would never be completed.
The next drbd_suspend_io() will then lock up,
waiting forever for these requests to complete.
Fixed in several code paths:
Make sure the connection state is NetworkFailure or worse
before starting the cleanup in drbd_disconnect().
This should make sure the cleanup won't miss any requests.
Disallow receive_state() to "upgrade" the connection state
from an error state. This will make sure the "illegal" state
transition won't happen.
For all connection failure states,
relax the safe-guard in sanitize_state() again
to silently mask out those state changes
(e.g. Timeout -> Connected becomes Timeout -> Timeout).
Note by Philipp Reisner:
The 3rd chunk described as "relax the safe-guard..."
is not there in 8.4 as it is relaxed to the maximum in
8.4 already
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
New config option for the disk secition "read-balancing", with
the values: prefer-local, prefer-remote, round-robin, when-congested-remote.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Recent commit
drbd: Move write_ordering from mdev to tconn
introduced a new idr_for_each loop over all volumes,
but did not take necessary rcu locks or krefs.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Wait until IO is drained in all volumes.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
This is necessary since the transfer_log on the sending is also
per tconn.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
An epoch object needs a pointer to the mdev it was received for.
This is necessary to be able to send the barrier ack packet for
the same volume as the original barrier packet was assigned to.
This prepares the next step, in which the (receiver side)
epoch list is moved from the device (mdev) to the connection (tconn)
object.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
This is necessary in order to prepare the move of the (receiver side)
epoch list from the device (mdev) to the connection (tconn) objects.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd-8.3:
drbd: O_SYNC gives EIO on ramdisks for some kernels (eg. RHEL6).
drbd: send intermediate state change results to the peer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Since version 4.6.1 gcc warns about variables that get
a value assigned, but which are never read later on.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
With sync-after dependencies, given "lucky" timing of pause/unpause
events, and the end of an empty (0 bits set) resync was sometimes not
detected on the SyncTarget, leading to a "stalled" SyncSource state.
Fixed this by expecting not only "Inconsistent -> UpToDate" but also
"Consistent -> UpToDate" transitions for the peer disk state
to end a resync.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>