Commit Graph

83 Commits

Author SHA1 Message Date
Lukas Joswiak 795b666e23 Fix a rare configuration database data loss bug
See the comment contained in this commit. This bug could only manifest
under a specific set of circumstances:

1. A coordinator change is started
2. The coordinator change succeeds, but its action of clearing
   `previousCoordinatorsKey` is delayed.
3. A minority of `ConfigNode`s have an old state of the configuration
   database, compared to the majority.
4. A `ConfigNode` in the majority dies and permanently loses data.
5. A long delay occurs on the `PaxosConfigConsumer` when it tries to
   read the latest changes from the `ConfigNode`s.

In the above circumstances, the `ConfigBroadcaster` could incorrectly
send a snapshot of an old state of the configuration database to a
majority of `ConfigNode`s. This would cause new, durable, and
acknowledged commit data to be overwritten.

Note that this bug only affects the configuration database (used for
knob storage). It does not affect the normal keyspace.
2022-11-22 11:20:04 -08:00
sfc-gh-tclinkenbeard 74212eeacf Encapsulate CounterCollection 2022-10-25 10:17:15 -07:00
Lukas Joswiak 8c50f98c00 Update type of coordinators hash
This fixes some serialization issues due to `BinaryReader` not being
able to deserialize types of size_t.
2022-09-13 16:53:54 -07:00
Lukas Joswiak 7ee6be9238 Simplify how ConfigBroadcastInterface is stored on worker 2022-09-13 16:53:54 -07:00
Lukas Joswiak 809d77c2ab Fix issue where annotations were not being serialized 2022-09-13 16:53:54 -07:00
Lukas Joswiak b2d395a304 Delay cluster controller restart when pushing knob updates to workers
This gives the `ConfigBroadcaster` time to send the knob change to all
workers before applying the change to itself and restarting.
2022-09-13 16:53:54 -07:00
Lukas Joswiak 8d237ba493 Fix various correctness and timeout issues
Contains the following fixes:

* When handling the special case rollforward where nodes can be rolled
  forward even if a majority are at version 0, we don't want to reset
  the live version of the node being rolled forward. This is because a
  quorum of nodes at version 0 can continue handing out and incrementing
  their live version, and if they are rolled forward there is the
  potential for them to go back in time in regard to their live version.
  So in this one special case, they should maintain their existing live
  version.
* Fixes some unseed issues due to fields not being initialized properly.
* Temporarily disables a coordinator restart in the recovery path (in
  the coordinated state) due to it causing a timeout. This needs more
  investigation in the future.
2022-09-13 16:53:54 -07:00
Lukas Joswiak 249ff2b2fd Fix configuration database unit tests 2022-09-13 16:53:54 -07:00
Lukas Joswiak cd2bbffa4c Add flag to disable the configuration database
The `--no-config-db` flag, passed to `fdbserver`, will disable the
configuration database. When this flag is specified, no `ConfigNode`s
will be started, the `ConfigBroadcaster` will not be started, and on a
coordinator change no attempt will be made to lock `ConfigNode`s.
2022-09-13 16:53:54 -07:00
Lukas Joswiak 74ac617a34 Add support for changing coordinators to the configuration database
Configuration database data lives on the coordinators. When a change
coordinators command is issued, the data must be sent to the new
coordinators to keep the database consistent.
2022-09-13 16:53:54 -07:00
Lukas Joswiak 9ca8a3c683 Reenable status json for dynamic knobs, add unit test 2022-06-21 11:43:05 -07:00
sfc-gh-tclinkenbeard a71099471b Update copyright header dates 2022-03-21 13:36:23 -07:00
Lukas Joswiak 582ba5d519 Fix issue with stuck config nodes
In rare circumstances where the cluster controller dies / moves to a new
machine, sometimes only a minority of `ConfigNode`s received messages
telling them they were registered. When the `ConfigNode`s attempt to
register with the new broadcaster (on the new cluster controller), the
knob system would get stuck because only a minority would be registered.
Part of this change allows registration of unregistered `ConfigNode`s if
there is no path to a majority of registered nodes.
2022-03-15 11:42:58 -07:00
Lukas Joswiak a8828db58e Load balance dynamic knob requests
This commit also removes an attempt to read the latest configuration
snapshot when a rollforward timeout occurs. The normal retry loop will
eventually fetch an up to date snapshot and the rollforward will be
retried.
2022-02-22 10:53:48 -08:00
Lukas Joswiak f300cec6ed Fast-track ConfigNode registration with Simple DB
When using the `ConfigDBType::Simple` configuration database, allow
nodes to immediately register with the broadcaster without having to
wait for a quorum.
2022-02-09 14:18:48 -08:00
Lukas Joswiak b5a3312a26 Factor out known replica update step 2022-02-09 13:43:33 -08:00
Lukas Joswiak 1d15aa5580 Fix internal function name 2022-02-09 13:43:32 -08:00
Lukas Joswiak d5a562e6b8 Fix dynamic knobs correctness issues 2022-02-09 13:43:32 -08:00
Lukas Joswiak 7e6bc27863 Remove linear time loop 2021-08-23 14:02:41 -07:00
Lukas Joswiak 08892eab55 Move client failure cleanup 2021-08-23 12:54:03 -07:00
Lukas Joswiak adc1025fa1 Clean up clientFailures periodically 2021-08-23 12:45:42 -07:00
Lukas Joswiak d004703cc8 Add worker kill unit test 2021-08-23 12:45:42 -07:00
sfc-gh-tclinkenbeard b6c669be23 Send ConfigBroadcastSnapshotReply to broadcaster 2021-08-19 14:45:30 -07:00
sfc-gh-tclinkenbeard 62303af832 Remove invalid assertion from ConfigBroadcastSnapshotRequest handling 2021-08-18 13:24:00 -07:00
sfc-gh-tclinkenbeard 0bacc310ef Reenable consumer in config broadcaster 2021-08-17 12:09:12 -07:00
sfc-gh-tclinkenbeard 616a01d01d Only register each worker once with config broadcaster (consumer currently disabled) 2021-08-17 11:45:50 -07:00
sfc-gh-tclinkenbeard 3418c20867 Merge remote-tracking branch 'origin/master' into paxos-config-db 2021-08-16 10:49:47 -07:00
Lukas Joswiak 1faec36bc6 Wait for all snapshot replies before sending incremental changes 2021-08-11 11:17:51 -07:00
Lukas Joswiak c098a1128d Push snapshot changes to local configuration on refresh 2021-08-11 09:13:22 -07:00
Lukas Joswiak b112560c94 Reorder registerWorker to prevent potential conflict 2021-08-10 15:09:35 -07:00
Lukas Joswiak f018af6ee4
Update fdbserver/ConfigBroadcaster.actor.cpp
Co-authored-by: Trevor Clinkenbeard <trevor.clinkenbeard@snowflake.com>
2021-08-10 13:24:41 -07:00
Lukas Joswiak d27c9e2520 Revert error check 2021-08-10 12:41:41 -07:00
Lukas Joswiak a838a47b0b Use ActorCollection for consumer future 2021-08-10 12:27:19 -07:00
Lukas Joswiak 598b23f8d4 Merge branch 'features/broadcaster-push' of github.com:sfc-gh-ljoswiak/foundationdb into features/broadcaster-push 2021-08-10 12:08:16 -07:00
Lukas Joswiak 5dfd7c4b1a Remove redundant dead worker check 2021-08-10 11:56:58 -07:00
Lukas Joswiak cf81b0650d Only register consumer once on the broadcaster 2021-08-10 11:56:16 -07:00
Lukas Joswiak 72e55ef72e Add broadcaster error check to unit tests 2021-08-10 11:39:29 -07:00
Lukas Joswiak 564a3d69b7 Rename config broadcast interface messages 2021-08-10 11:39:29 -07:00
Lukas Joswiak 85fa264a16 Remove move constructor and assignment operator 2021-08-10 11:39:29 -07:00
Lukas Joswiak 305a17c811 Improve config broadcaster logic, fix unit tests 2021-08-10 11:39:29 -07:00
Lukas Joswiak 72e63db856 Send ConfigBroadcastInterface to ConfigBroadcaster instead of entire worker interface 2021-08-10 11:39:29 -07:00
Lukas Joswiak 3946cf94ff Push updates to workers (clang-formatted files) 2021-08-10 11:39:29 -07:00
Lukas Joswiak 092ab4302b Push updates to workers 2021-08-10 11:39:29 -07:00
Lukas Joswiak 3a607d9a38
Update fdbserver/ConfigBroadcaster.actor.cpp
Co-authored-by: Trevor Clinkenbeard <trevor.clinkenbeard@snowflake.com>
2021-08-10 09:36:39 -07:00
Lukas Joswiak c97a1b3b4d Remove move constructor and assignment operator 2021-08-09 15:33:01 -07:00
Lukas Joswiak 5249105b04 Improve config broadcaster logic, fix unit tests 2021-08-09 13:20:06 -07:00
sfc-gh-tclinkenbeard 82546853c0 Rename UseConfigDB to ConfigDBType 2021-08-09 10:04:35 -07:00
sfc-gh-tclinkenbeard b15daf1886 Added PImpl class
This class propogates the constness of methods to their pimpl
implementations
2021-08-09 10:04:34 -07:00
Lukas Joswiak fae29dbb1f Send ConfigBroadcastInterface to ConfigBroadcaster instead of entire worker interface 2021-08-06 12:42:07 -07:00
Lukas Joswiak 38d05a2f49 Push updates to workers (clang-formatted files) 2021-08-05 18:57:12 -07:00