Commit Graph

567 Commits

Author SHA1 Message Date
Vaidas Gasiunas 092b5cee4b MVC2.0: Rollback added code 2022-02-14 13:50:42 -08:00
Lukas Joswiak d5a562e6b8 Fix dynamic knobs correctness issues 2022-02-09 13:43:32 -08:00
Ata E Husain Bohra f3c3ab06f1 Add new FDB EncryptKeyProxy role
diff-1: Address review comments

Major changes includes:

1. Add a new FDB role responsible- EncyrptKeyProxy. The role is
   responsible to expose APIs to fetch encyrption keys interacting
   with external Encryption KeyManager interface.
2. The process is a FDB singleton process following similar recruitment
   rules as other singleton processes in the system.
3. Code to recruit the worker process; given the encryption keys are
   needed during recovery (decode TLog records), for now the process
   is co-located in same datacenter as ClusterController.
4. Skeleton process actor code; more functionality will be added in
   subsequent PRs.

NOTE: The code is protected under a SERVER_KNOB with the default
      value as 'false' for now.:%s
2022-01-25 23:12:49 -08:00
Ata E Husain Bohra 87ee4cf958 Add new FDB EncryptKeyProxy role
Major changes includes:

1. Add a new FDB role responsible- EncyrptKeyProxy. The role is
   responsible to expose APIs to fetch encyrption keys interacting
   with external Encryption KeyManager interface.
2. The process is a FDB singleton process following similar recruitment
   rules as other singleton processes in the system.
3. Code to recruit the worker process; given the encryption keys are
   needed during recovery (decode TLog records), for now the process
   is co-located in same datacenter as ClusterController.
4. Skeleton process actor code; more functionality will be added in
   subsequent PRs.

NOTE: The code is protected under a SERVER_KNOB with the default
      value as 'false' for now.
2022-01-25 17:38:27 -08:00
Ata E Husain Bohra 703364d146
Update cluster recovery documentation (#6255)
Patch updates code documentation to reflect the recent code
refactoring where ClusterController process drives recovery
instead of sequencer/master process.
2022-01-18 13:54:00 -08:00
Ata E Husain Bohra 936bf5336a
Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine" (#6191)
* Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine""

Major changes includes:
1. Re-revert Sequencer refactor commits listed below (in listed order):
1.a. This reverts commit bb17e194d9.
1.b. This reverts commit d174bb2e06.
1.c. This reverts commit 30b05b469c.

2. Update Status.actor to track ClusterController interface to track
   recovery status.
3. Introduce a ServerKnob to define "cluster recovery trace event"
   prefix; for now keeping it as "Master", however, it should allow
   smooth transition to "Cluster" prefix as it seems more appropriate.
2022-01-06 12:15:51 -08:00
Aaron Molitor 30b05b469c Revert "Refactor: ClusterController driving cluster-recovery state machine"
This reverts commit dfe9d184ff.
2021-12-24 11:25:51 -08:00
Aaron Molitor d174bb2e06 Revert "Refactor: ClusterController driving cluster-recovery state machine"
This reverts commit abd2959702.
2021-12-24 11:25:51 -08:00
Aaron Molitor bb17e194d9 Revert "Refactor: ClusterController driving cluster-recovery state machine"
This reverts commit 1520390bc5.
2021-12-24 11:25:51 -08:00
Ata E Husain Bohra 1520390bc5 Refactor: ClusterController driving cluster-recovery state machine
diff-1: Address Jingyu's review comments
 diff-2: Introduce ClusterRecovery actor to seperate out
         cluster recovery code

At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
   master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
   responsible to recruit all other processes as well restore the
   cluster state.

Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.

Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
   process like other worker processes compared to current scheme
   where "sequencer" process gets special treatment. In newer scheme
   sequencer is responsible for maintaining/providing
   "committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
   the sequencer though orchestrating the recovery state machine, it
   need to reachout to the ClusterController for recruiting worker
   processes etc.

NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.

Next Steps:
Cluster recovery documentation will be updated in near future.
2021-12-22 14:06:27 -08:00
Ata E Husain Bohra abd2959702 Refactor: ClusterController driving cluster-recovery state machine
diff-1: Address Jingyu's review comments

At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
   master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
   responsible to recruit all other processes as well restore the
   cluster state.

Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.

Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
   process like other worker processes compared to current scheme
   where "sequencer" process gets special treatment. In newer scheme
   sequencer is responsible for maintaining/providing
   "committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
   the sequencer though orchestrating the recovery state machine, it
   need to reachout to the ClusterController for recruiting worker
   processes etc.

NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.

Next Steps:
Cluster recovery documentation will be updated in near future.
2021-12-22 14:06:27 -08:00
Ata E Husain Bohra dfe9d184ff Refactor: ClusterController driving cluster-recovery state machine
At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
   master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
   responsible to recruit all other processes as well restore the
   cluster state.

Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.

Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
   process like other worker processes compared to current scheme
   where "sequencer" process gets special treatment. In newer scheme
   sequencer is responsible for maintaining/providing
   "committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
   the sequencer though orchestrating the recovery state machine, it
   need to reachout to the ClusterController for recruiting worker
   processes etc.

NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.

Next Steps:
Cluster recovery documentation will be updated in near future.
2021-12-22 14:06:27 -08:00
Andrew Noyes def41697bf
Merge pull request #6083 from sfc-gh-tclinkenbeard/remove-temporaries
Avoid creating unnecessary temporary objects
2021-12-06 13:24:56 -08:00
Zhe Wu dc88d3fa37 use CC_ENABLE_WORKER_HEALTH_MONITOR knob to guard remoteDCIsHealthy logic 2021-12-06 09:33:45 -08:00
sfc-gh-tclinkenbeard d01a363e29 Avoid creating unnecessary temporary objects 2021-12-01 23:48:34 -08:00
Evan Tschannen 557186ed17
Merge pull request #5909 from sfc-gh-jfu/jfu-cc-request-dbinfo
Change dbinfo broadcast to be explicitly requested by the worker registration message
2021-11-16 15:01:42 -08:00
Evan Tschannen 964d0209ca
Merge pull request #5637 from sfc-gh-ljoswiak/features/data-loss-prevention
Data loss protection when joining new cluster
2021-11-15 15:26:32 -08:00
Vaidas Gasiunas 51b8ccf7d3 Merge remote-tracking branch 'apple/master' into notify-client-lib-changes 2021-11-10 18:40:34 +01:00
Lukas Joswiak 74cf64fe0f Sync cluster ID through ServerDBInfo 2021-11-09 12:29:48 -08:00
Lukas Joswiak 3e2c65bb11 Allow tlog to join another cluster but retain its data 2021-11-09 12:29:48 -08:00
Lukas Joswiak 30867750b5 Add protection against storage and tlog data deletion when joining a new cluster 2021-11-09 12:29:47 -08:00
Jon Fu 00f4bd8536 Check ccInterface against serverDbInfo's cc and make broadcast unconditional for first registration 2021-11-08 12:43:02 -05:00
Jon Fu 4e8625ccc0 retain old behaviour along with explicit request 2021-11-03 17:23:07 -04:00
Jon Fu 59f0a2c3e5 Change dbinfo broadcast to be explicitly requested by the worker registration message 2021-11-03 15:51:21 -04:00
Evan Tschannen ee00135a6b skip good recruitment errors when doing simulation only validation 2021-11-01 13:24:15 -07:00
Evan Tschannen 78e36e7590 fix: simulation only validation could throw errors which would impact the behavior of the cluster controller 2021-11-01 13:24:15 -07:00
Evan Tschannen ddf235713e strengthen assert 2021-10-28 16:40:30 -07:00
Evan Tschannen 4d8ee2ed33 fix: simple recruitment could succeed with less than the required replication factor 2021-10-28 16:38:04 -07:00
Vaidas Gasiunas 875824b186 MVC2.0: Notify clients about relevant changes of client libraries 2021-10-27 23:43:40 +02:00
Josh Slocum 0ff8ddc2b6 Merge branch 'master' into blob_full_clean 2021-10-25 13:38:48 -05:00
A.J. Beamon e882eb33fc Abstract the cluster file into a cluster connection record that can be backed by something other than the filesystem. 2021-10-22 11:05:18 -07:00
Josh Slocum 773886515e Merge branch 'feature-range-feed' into blob_full_clean 2021-10-22 11:07:51 -05:00
Josh Slocum 912ef76f1c cleanup before merge 2021-10-18 17:11:14 -05:00
Suraj Gupta 5466bdb569 Gate more entry points to BM recruitment. 2021-10-18 15:04:22 -04:00
A.J. Beamon 507a09893c
Add ClientCount to ClusterControllerMetrics (#5748) 2021-10-17 20:47:11 -07:00
Josh Slocum 5f0ec0612a Merge branch 'feature-range-feed' into blob_full 2021-10-13 15:44:35 -05:00
Suraj Gupta 2ec8781224 Merge knobs into one. 2021-10-13 14:00:37 -04:00
Suraj Gupta 5a6a052c55 Add a knob to gate blob-related work. 2021-10-13 09:48:02 -04:00
Zhe Wu 645cfc85a0 fix remote health variables declaration order 2021-10-07 21:54:25 -07:00
Zhe Wu 6540b6eec5 Some improvements for grey failure failover 2021-10-07 20:42:55 -07:00
Zhe Wu c07a07dbbe Take uptime into account when making failover decision 2021-10-07 11:19:34 -07:00
Zhe Wu 62197faa46 Add more comments to the code 2021-10-07 11:19:34 -07:00
Zhe Wu c0fbe5471f Implement the core logic of grey failure triggered failover 2021-10-07 11:19:34 -07:00
Suraj Gupta 282f9d35cd Cleanup comments and debugging code. 2021-10-04 11:07:08 -04:00
Suraj Gupta 4d54669ccd Recruit the blob workers via blob manager.
In this PR, the blob manager now recruits blob workers
(via communication with the cluster controller). Blob workers
are onboarded as blob worker processes enter the cluster.
2021-10-04 11:07:08 -04:00
Chang Liu c523964ff7 Fix roll trace event issue
Description

Testing
2021-09-24 09:53:32 -07:00
Chang Liu 8427e40cbe Fix roll trace event issue
Description

Testing
2021-09-24 09:53:32 -07:00
Chang Liu 48990058a3 Fix roll trace event issue
Description

Testing
2021-09-24 09:53:32 -07:00
Zhe Wu e28fef6264 Fix failover logic in checkRecoveryStalled: failover only when remote is enabled 2021-09-23 20:12:22 -07:00
Suraj Gupta 5fa6c687d6 Add blob manager as a singleton. 2021-09-23 10:45:37 -04:00