Commit Graph

587 Commits

Author SHA1 Message Date
Josh Slocum 4b254d259c Ensuring BM split retry is idempotent 2022-03-10 11:54:57 -06:00
Josh Slocum e71b3533f9 Merge branch 'main' into blob_integration 2022-03-09 08:59:56 -06:00
A.J. Beamon 5fa9d3e1b7 Add a tenant parameter to read and commit requests. Store a map of all tenants on commit proxy and storage servers. Add an option to require tenant mode. 2022-03-06 21:54:21 -08:00
Josh Slocum 623db663dc don't reset watch config transaction 2022-02-25 08:48:52 -06:00
Renxuan Wang 06b1d06d38 Support hostname in coordinators commands. 2022-02-24 23:02:29 -08:00
A.J. Beamon 250a88e682 Enforce that trace event suppression calls happen first when using trace event call chaining. Fix various instances where we weren't following this requirement. 2022-02-24 12:25:52 -08:00
Josh Slocum 38a75a8b89 Merge branch 'main' into blob_integration 2022-02-17 17:47:38 -06:00
Vaidas Gasiunas 092b5cee4b MVC2.0: Rollback added code 2022-02-14 13:50:42 -08:00
Lukas Joswiak d5a562e6b8 Fix dynamic knobs correctness issues 2022-02-09 13:43:32 -08:00
Ata E Husain Bohra f3c3ab06f1 Add new FDB EncryptKeyProxy role
diff-1: Address review comments

Major changes includes:

1. Add a new FDB role responsible- EncyrptKeyProxy. The role is
   responsible to expose APIs to fetch encyrption keys interacting
   with external Encryption KeyManager interface.
2. The process is a FDB singleton process following similar recruitment
   rules as other singleton processes in the system.
3. Code to recruit the worker process; given the encryption keys are
   needed during recovery (decode TLog records), for now the process
   is co-located in same datacenter as ClusterController.
4. Skeleton process actor code; more functionality will be added in
   subsequent PRs.

NOTE: The code is protected under a SERVER_KNOB with the default
      value as 'false' for now.:%s
2022-01-25 23:12:49 -08:00
Ata E Husain Bohra 87ee4cf958 Add new FDB EncryptKeyProxy role
Major changes includes:

1. Add a new FDB role responsible- EncyrptKeyProxy. The role is
   responsible to expose APIs to fetch encyrption keys interacting
   with external Encryption KeyManager interface.
2. The process is a FDB singleton process following similar recruitment
   rules as other singleton processes in the system.
3. Code to recruit the worker process; given the encryption keys are
   needed during recovery (decode TLog records), for now the process
   is co-located in same datacenter as ClusterController.
4. Skeleton process actor code; more functionality will be added in
   subsequent PRs.

NOTE: The code is protected under a SERVER_KNOB with the default
      value as 'false' for now.
2022-01-25 17:38:27 -08:00
Ata E Husain Bohra 703364d146
Update cluster recovery documentation (#6255)
Patch updates code documentation to reflect the recent code
refactoring where ClusterController process drives recovery
instead of sequencer/master process.
2022-01-18 13:54:00 -08:00
Ata E Husain Bohra 936bf5336a
Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine" (#6191)
* Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine""

Major changes includes:
1. Re-revert Sequencer refactor commits listed below (in listed order):
1.a. This reverts commit bb17e194d9.
1.b. This reverts commit d174bb2e06.
1.c. This reverts commit 30b05b469c.

2. Update Status.actor to track ClusterController interface to track
   recovery status.
3. Introduce a ServerKnob to define "cluster recovery trace event"
   prefix; for now keeping it as "Master", however, it should allow
   smooth transition to "Cluster" prefix as it seems more appropriate.
2022-01-06 12:15:51 -08:00
Josh Slocum bc69521a91 Several fixes with restarting BW/BM 2022-01-05 12:48:53 -06:00
Aaron Molitor 30b05b469c Revert "Refactor: ClusterController driving cluster-recovery state machine"
This reverts commit dfe9d184ff.
2021-12-24 11:25:51 -08:00
Aaron Molitor d174bb2e06 Revert "Refactor: ClusterController driving cluster-recovery state machine"
This reverts commit abd2959702.
2021-12-24 11:25:51 -08:00
Aaron Molitor bb17e194d9 Revert "Refactor: ClusterController driving cluster-recovery state machine"
This reverts commit 1520390bc5.
2021-12-24 11:25:51 -08:00
Ata E Husain Bohra 1520390bc5 Refactor: ClusterController driving cluster-recovery state machine
diff-1: Address Jingyu's review comments
 diff-2: Introduce ClusterRecovery actor to seperate out
         cluster recovery code

At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
   master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
   responsible to recruit all other processes as well restore the
   cluster state.

Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.

Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
   process like other worker processes compared to current scheme
   where "sequencer" process gets special treatment. In newer scheme
   sequencer is responsible for maintaining/providing
   "committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
   the sequencer though orchestrating the recovery state machine, it
   need to reachout to the ClusterController for recruiting worker
   processes etc.

NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.

Next Steps:
Cluster recovery documentation will be updated in near future.
2021-12-22 14:06:27 -08:00
Ata E Husain Bohra abd2959702 Refactor: ClusterController driving cluster-recovery state machine
diff-1: Address Jingyu's review comments

At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
   master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
   responsible to recruit all other processes as well restore the
   cluster state.

Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.

Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
   process like other worker processes compared to current scheme
   where "sequencer" process gets special treatment. In newer scheme
   sequencer is responsible for maintaining/providing
   "committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
   the sequencer though orchestrating the recovery state machine, it
   need to reachout to the ClusterController for recruiting worker
   processes etc.

NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.

Next Steps:
Cluster recovery documentation will be updated in near future.
2021-12-22 14:06:27 -08:00
Ata E Husain Bohra dfe9d184ff Refactor: ClusterController driving cluster-recovery state machine
At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
   master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
   responsible to recruit all other processes as well restore the
   cluster state.

Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.

Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
   process like other worker processes compared to current scheme
   where "sequencer" process gets special treatment. In newer scheme
   sequencer is responsible for maintaining/providing
   "committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
   the sequencer though orchestrating the recovery state machine, it
   need to reachout to the ClusterController for recruiting worker
   processes etc.

NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.

Next Steps:
Cluster recovery documentation will be updated in near future.
2021-12-22 14:06:27 -08:00
Suraj Gupta f68c1fd3c4 Use _sr and fix disabling case. 2021-12-10 14:00:34 -06:00
Suraj Gupta c89fd0c3a8 reset txn after getting 2021-12-10 14:00:34 -06:00
Suraj Gupta 968a4f9f50 Don't rely on database config to be updated. 2021-12-10 14:00:34 -06:00
Suraj Gupta 640cee2072 Start BM without config change if config is enabled. 2021-12-10 14:00:34 -06:00
Suraj Gupta d3fbad74a2 cleanup debugging 2021-12-10 14:00:34 -06:00
Suraj Gupta cb568bbd55 Add watch on config key. 2021-12-10 14:00:34 -06:00
Suraj Gupta fc3376fe8f Move client knob to database config for blob granules. 2021-12-10 14:00:34 -06:00
Evan Tschannen 130def7897 fix: make sure both conditions are true before better master exists is executed 2021-12-06 13:50:31 -08:00
Andrew Noyes def41697bf
Merge pull request #6083 from sfc-gh-tclinkenbeard/remove-temporaries
Avoid creating unnecessary temporary objects
2021-12-06 13:24:56 -08:00
Evan Tschannen 951bc4acd7 fix: do not call better master exists until the long lived stateless processes have settled into their desired locations 2021-12-06 13:12:27 -08:00
Zhe Wu dc88d3fa37 use CC_ENABLE_WORKER_HEALTH_MONITOR knob to guard remoteDCIsHealthy logic 2021-12-06 09:33:45 -08:00
Evan Tschannen a4fff19320 Revert "fix: long lived stateless processes need to be the last comparison criteria to avoid better master exists from changing behavior from the original recruitment"
This reverts commit b55b095ed0.
2021-12-05 20:26:13 -08:00
Evan Tschannen b55b095ed0 fix: long lived stateless processes need to be the last comparison criteria to avoid better master exists from changing behavior from the original recruitment 2021-12-05 20:18:03 -08:00
Evan Tschannen ff47013158 fix: check if the master has been killed while waiting for getNextBMEpoch 2021-12-04 17:08:23 -08:00
sfc-gh-tclinkenbeard d01a363e29 Avoid creating unnecessary temporary objects 2021-12-01 23:48:34 -08:00
Evan Tschannen 557186ed17
Merge pull request #5909 from sfc-gh-jfu/jfu-cc-request-dbinfo
Change dbinfo broadcast to be explicitly requested by the worker registration message
2021-11-16 15:01:42 -08:00
Evan Tschannen 964d0209ca
Merge pull request #5637 from sfc-gh-ljoswiak/features/data-loss-prevention
Data loss protection when joining new cluster
2021-11-15 15:26:32 -08:00
Vaidas Gasiunas 51b8ccf7d3 Merge remote-tracking branch 'apple/master' into notify-client-lib-changes 2021-11-10 18:40:34 +01:00
Lukas Joswiak 74cf64fe0f Sync cluster ID through ServerDBInfo 2021-11-09 12:29:48 -08:00
Lukas Joswiak 3e2c65bb11 Allow tlog to join another cluster but retain its data 2021-11-09 12:29:48 -08:00
Lukas Joswiak 30867750b5 Add protection against storage and tlog data deletion when joining a new cluster 2021-11-09 12:29:47 -08:00
Jon Fu 00f4bd8536 Check ccInterface against serverDbInfo's cc and make broadcast unconditional for first registration 2021-11-08 12:43:02 -05:00
Jon Fu 4e8625ccc0 retain old behaviour along with explicit request 2021-11-03 17:23:07 -04:00
Jon Fu 59f0a2c3e5 Change dbinfo broadcast to be explicitly requested by the worker registration message 2021-11-03 15:51:21 -04:00
Evan Tschannen ee00135a6b skip good recruitment errors when doing simulation only validation 2021-11-01 13:24:15 -07:00
Evan Tschannen 78e36e7590 fix: simulation only validation could throw errors which would impact the behavior of the cluster controller 2021-11-01 13:24:15 -07:00
Evan Tschannen ddf235713e strengthen assert 2021-10-28 16:40:30 -07:00
Evan Tschannen 4d8ee2ed33 fix: simple recruitment could succeed with less than the required replication factor 2021-10-28 16:38:04 -07:00
Vaidas Gasiunas 875824b186 MVC2.0: Notify clients about relevant changes of client libraries 2021-10-27 23:43:40 +02:00
Josh Slocum 0ff8ddc2b6 Merge branch 'master' into blob_full_clean 2021-10-25 13:38:48 -05:00