Commit Graph

567 Commits

Author SHA1 Message Date
Renxuan Wang 2a59c5fd4e
Workers should monitor coordinators in submitCandidacy(). (#6655)
* Workers should monitor coordinators in submitCandidacy().

* Change re-resolve delay to a knob.
2022-03-24 19:20:42 -07:00
Josh Slocum f27475e2f4 Merge branch 'main' into blob_integration 2022-03-22 11:41:58 -05:00
sfc-gh-tclinkenbeard a71099471b Update copyright header dates 2022-03-21 13:36:23 -07:00
Josh Slocum 37e7c80f26 Merge branch 'main' into blob_integration 2022-03-17 18:45:42 -05:00
Josh Slocum 0f9e88572a Cleaning up debugging and fixing race in blob manager recruitment 2022-03-17 14:57:43 -05:00
Tao Lin e2c7c30faf
GetMappedRange support serializable & check RYW & continuation (#6181) 2022-03-10 10:05:44 -08:00
Josh Slocum b21d0943b9 client-focused cleanup 2022-03-09 10:01:25 -06:00
Josh Slocum e71b3533f9 Merge branch 'main' into blob_integration 2022-03-09 08:59:56 -06:00
A.J. Beamon 250a88e682 Enforce that trace event suppression calls happen first when using trace event call chaining. Fix various instances where we weren't following this requirement. 2022-02-24 12:25:52 -08:00
Renxuan Wang 3c1394578b Address comments. 2022-02-22 16:29:59 -08:00
Renxuan Wang 622d89b552 Rebase on main.
Since we changed ClusterConnectionString's status flag from boolean to enum in #6422, we need to update this PR correspondingly.
2022-02-22 16:29:59 -08:00
Renxuan Wang 8eb7a10404 Address comments. 2022-02-22 16:29:59 -08:00
Renxuan Wang 481587a8c6 Turn on hostname logic. 2022-02-22 16:29:59 -08:00
Josh Slocum 38a75a8b89 Merge branch 'main' into blob_integration 2022-02-17 17:47:38 -06:00
Lukas Joswiak d5a562e6b8 Fix dynamic knobs correctness issues 2022-02-09 13:43:32 -08:00
Yi Wu cda68a0e4d Support xxhash3 for checksuming DiskQueue for TLogs 2022-02-07 13:32:52 -08:00
Renxuan Wang f9f3735f73 Add resolveHostnamesBlocking() in ConnectionString and IClusterConnectionRecord.
Also, combine IClusterConnectionRecord::getConnectionString() and IClusterConnectionRecord::getMutableConnectionString() to IClusterConnectionRecord::getConnectionString(), and rename setConnectionString() to setAndPersistConnectionString().
2022-01-28 12:20:41 -08:00
Ata E Husain Bohra 87ee4cf958 Add new FDB EncryptKeyProxy role
Major changes includes:

1. Add a new FDB role responsible- EncyrptKeyProxy. The role is
   responsible to expose APIs to fetch encyrption keys interacting
   with external Encryption KeyManager interface.
2. The process is a FDB singleton process following similar recruitment
   rules as other singleton processes in the system.
3. Code to recruit the worker process; given the encryption keys are
   needed during recovery (decode TLog records), for now the process
   is co-located in same datacenter as ClusterController.
4. Skeleton process actor code; more functionality will be added in
   subsequent PRs.

NOTE: The code is protected under a SERVER_KNOB with the default
      value as 'false' for now.
2022-01-25 17:38:27 -08:00
Josh Slocum 42a36dc756 Fixed Blob Manager recruitment error and Blob Worker monitoring error 2022-01-24 09:46:37 -06:00
Josh Slocum 6b202fc9a8 Fixed stuck change feed storage updater and improved debugging 2022-01-11 15:35:54 -06:00
A.J. Beamon b44ebe0c65 Fix typo in trace event name 2022-01-11 13:22:00 -08:00
Ata E Husain Bohra 936bf5336a
Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine" (#6191)
* Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine""

Major changes includes:
1. Re-revert Sequencer refactor commits listed below (in listed order):
1.a. This reverts commit bb17e194d9.
1.b. This reverts commit d174bb2e06.
1.c. This reverts commit 30b05b469c.

2. Update Status.actor to track ClusterController interface to track
   recovery status.
3. Introduce a ServerKnob to define "cluster recovery trace event"
   prefix; for now keeping it as "Master", however, it should allow
   smooth transition to "Cluster" prefix as it seems more appropriate.
2022-01-06 12:15:51 -08:00
Aaron Molitor 30b05b469c Revert "Refactor: ClusterController driving cluster-recovery state machine"
This reverts commit dfe9d184ff.
2021-12-24 11:25:51 -08:00
Ata E Husain Bohra dfe9d184ff Refactor: ClusterController driving cluster-recovery state machine
At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
   master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
   responsible to recruit all other processes as well restore the
   cluster state.

Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.

Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
   process like other worker processes compared to current scheme
   where "sequencer" process gets special treatment. In newer scheme
   sequencer is responsible for maintaining/providing
   "committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
   the sequencer though orchestrating the recovery state machine, it
   need to reachout to the ClusterController for recruiting worker
   processes etc.

NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.

Next Steps:
Cluster recovery documentation will be updated in near future.
2021-12-22 14:06:27 -08:00
Evan Tschannen f2838740f1 fix: do not allow more than one blob worker per address 2021-12-03 10:29:22 -08:00
negoyal 2725183b26 Don't include an unreliable process in the protected list. 2021-12-03 09:44:52 -08:00
Steve Atherton 3caca74ac2 Merge commit 'fd707c6d7ee80de6d9fda5796da2d0add10abd79' into bit-flipping-workload 2021-11-16 21:54:27 -08:00
Evan Tschannen 557186ed17
Merge pull request #5909 from sfc-gh-jfu/jfu-cc-request-dbinfo
Change dbinfo broadcast to be explicitly requested by the worker registration message
2021-11-16 15:01:42 -08:00
Steve Atherton 035e0d6e52
Merge branch 'master' into bit-flipping-workload 2021-11-16 14:42:22 -08:00
Steve Atherton c53f5aa110 Renamed redwood to redwood-1-experimental and file extension to .redwood-v1. 2021-11-16 02:15:22 -08:00
Evan Tschannen 964d0209ca
Merge pull request #5637 from sfc-gh-ljoswiak/features/data-loss-prevention
Data loss protection when joining new cluster
2021-11-15 15:26:32 -08:00
Tao Lin fdb3b72e35 Introduce GetRangeAndFlatMap to push computations down to FDB
Re-introduce #5609
2021-11-09 13:52:28 -08:00
Lukas Joswiak 3988b11fd6 Cleanup 2021-11-09 12:29:48 -08:00
Lukas Joswiak 30867750b5 Add protection against storage and tlog data deletion when joining a new cluster 2021-11-09 12:29:47 -08:00
Jon Fu 2887e1c30a set flag to true when doing first registration 2021-11-09 12:44:07 -05:00
Jon Fu 00f4bd8536 Check ccInterface against serverDbInfo's cc and make broadcast unconditional for first registration 2021-11-08 12:43:02 -05:00
Tao Lin 586cc3b102
Revert "Introduce GetRangeAndFlatMap to push computations down to FDB" 2021-11-04 08:46:56 -07:00
Tao Lin 6c98e35893 Rename Hop to FlatMap 2021-11-03 13:32:01 -07:00
Tao Lin 0853661d13 Introduce getRangeAndHop to push computations down to FDB 2021-11-03 13:21:16 -07:00
Jon Fu 59f0a2c3e5 Change dbinfo broadcast to be explicitly requested by the worker registration message 2021-11-03 15:51:21 -04:00
negoyal 1e7338b6c3 Merge branch 'master' into bit-flipping-workload 2021-10-28 14:24:49 -07:00
Josh Slocum 0ff8ddc2b6 Merge branch 'master' into blob_full_clean 2021-10-25 13:38:48 -05:00
A.J. Beamon e882eb33fc Abstract the cluster file into a cluster connection record that can be backed by something other than the filesystem. 2021-10-22 11:05:18 -07:00
Josh Slocum 5f0ec0612a Merge branch 'feature-range-feed' into blob_full 2021-10-13 15:44:35 -05:00
negoyal f913dfed97 Merge branch 'master' into bit-flipping-workload 2021-10-11 16:34:57 -07:00
Zhe Wu 784d899afb fix assignment error in addressInDbAndRemoteDc unittest 2021-10-07 16:17:09 -07:00
Zhe Wu c07a07dbbe Take uptime into account when making failover decision 2021-10-07 11:19:34 -07:00
Zhe Wu 62197faa46 Add more comments to the code 2021-10-07 11:19:34 -07:00
Zhe Wu c0fbe5471f Implement the core logic of grey failure triggered failover 2021-10-07 11:19:34 -07:00
Suraj Gupta 282f9d35cd Cleanup comments and debugging code. 2021-10-04 11:07:08 -04:00