Commit Graph

181 Commits

Author SHA1 Message Date
Jingyu Zhou 622520bd2d Return the source team if remote DC is dead
Also refactor the code with findTeamFromServers().
2023-02-10 11:11:07 -08:00
Jingyu Zhou 9aa15b459c Clean up trace events 2023-02-10 11:11:07 -08:00
Jingyu Zhou 6c4a9b5f23 Fix DD stuck when remote DC is dead
When remote DC is down, the remote team collection of DD can initializing
waiting for the remote to recover (all_tlog_recruited state). However, the
getTeam request can already be served by the remote team collection. So, for
a RelocateShard (data movement such as split, move), it will get a team for
the remote DC. But the data movement can't make progress on the remote team
because the remote DC hasn't recovered yet. Because of the stuck of data
movement, the primary cannot reach the "storage_recovered" state and stay in
accepting_commit state.

The specifc test failure: slow/ApiCorrectness.toml -s 339026305 -b on
at commit:  0edd899d65

In this test, primary DC has 1 SS killed, remote DC has 2 TLog and 2 SS killed.
So the remote is dead, the remaining 2 SSes can't make progress because of the
loss of 2 TLogs. The repairDeadDatacenter() can't reach the "storage_recovered"
state due to DD's failure of moving shards away from the killed SS in the
primary.

The fix is to exclude all remote in repairDeadDatacenter() so that tells DD to
mark all SSes in the remote as unhealthy. Another fix is to return empty
results for getTeam request if the remote team collection is not ready. This
will allow the data movement to continue, essentially remote team is not changed
for the data movement.
2023-02-10 11:11:07 -08:00
Yi Wu d3bc2afc8e
EaR: storage server uses encryption DB config (#9115)
The PR is updating storage server and Redwood to enable encryption based on the encryption mode in DB config, which was previously controlled by a knob. High level changes are
1. Passing encryption mode in DB config to storage server
    1.1 If it is a new storage server, pass the encryption mode through `InitializeStorageRequest`. the encryption mode is pass to Redwood for initialization
    1.2 If it is an existing storage server, on restart the storage server will send `GetStorageServerRejoinInfoRequest` to commit proxy, and commit proxy will return the current encryption mode, which it get from DB config on its own initialization. Storage server will compare the DB config encryption mode to the local storage encryption mode, and fail if they don't match
2. Adding a new `encryptionMode()` method to `IKeyValueStore`, which return a future of local encryption mode of the KV store instance. A KV store supporting encryption would need to persist its own encryption mode, and return the mode via the API.
3. Redwood accepts encryption mode from its constructor. For a new Redwood instance, caller has to specific the encryption mode, which will be stored in Redwood per-instance file header. For existing instance, caller is supposed to not passing the encryption mode, and let Redwood find it out from its own header.
4. Refactoring in Redwood to accommodate the above changes.
2023-02-06 14:02:31 -08:00
Xiaoxi Wang bbcb3cc018 extract KeyBackedConfig, StorageWiggleData class; solve template resolution problem; solve MV txn and native api conflict by splitting RunTransaction file 2023-01-02 23:34:39 -08:00
Xiaoxi Wang f13453fe63 Merge branch 'main' of https://github.com/apple/foundationdb into feature/main/wiggleDelay 2022-12-20 17:21:19 -08:00
Meng Xu e6b2254726 Resolve review comments: No functional change 2022-12-19 15:28:01 -08:00
Meng Xu a1d513b355 Fix:Exclusion stuck because DD cannot build new teams
Bug behavior:
When DD has zero healthy machine teams but more unhealthy machine teams
than the max machine teams DD plans to build, DD will stop building
new machine teams. Due to zero healthy machine team (and zero healthy
server team), DD cannot find a healthy destination team  to relocate data.
When data relocation stops, exclusion stops progressing and stuck.

Bug happens when we *shrink* a k-host cluster by
first adding k/2 new host;
then quickly excluding all old hosts.

Fix:
Let DD build temporary extra teams to relocate data.
The extra teams will be cleaned up later by DD's remove extra teams logic.

Simulation test:
There is no simulation test to cover cluster expansion scnenario.
To most closely simulate this behavior, we intentionally overbuild all possible
machine teams to trigger the condition that unhealthy teams is larger than
the maximum teams DD wants to build later.
2022-12-19 15:28:01 -08:00
Xiaoxi Wang a33b366f19 merge feature/main/ppwLoadBalance 2022-12-15 13:27:44 -08:00
Xiaoxi Wang 919c512cdc fix wiggler state setting 2022-12-15 12:14:40 -08:00
Xiaoxi Wang ab4778bd19 Merge branch 'main' of https://github.com/apple/foundationdb into feature/main/ppwLoadBalance 2022-12-15 11:36:20 -08:00
Xiaoxi Wang c12de23824 Merge branch 'main' of https://github.com/apple/foundationdb into feature/main/wiggleDelay 2022-12-11 14:27:22 -08:00
Xiaoxi Wang 3e4966d5bd persistent perpetual wiggle delay 2022-12-08 23:46:26 -05:00
sfc-gh-tclinkenbeard 68f14f017c Fix clang 15 compiler warnings 2022-12-08 13:59:37 -08:00
Xiaoxi Wang ccc494319c perpetual wiggle key functions 2022-12-08 16:46:05 -05:00
FoundationDB CI 86d6106dc1
format source code after switch to clang 15 2022-12-08 17:26:45 +00:00
Xiaoxi Wang 16d11143fa add smallLoadThreshold logic and change knobs 2022-12-07 11:45:49 -05:00
Xiaoxi Wang aae89c863d DDTeamCollection.getAverageShardBytes 2022-12-07 10:08:22 -05:00
Xiaoxi Wang 5d01d33531 Merge branch 'main' of https://github.com/apple/foundationdb into feature/main/ppwLoadBalance 2022-12-07 09:11:55 -05:00
Xiaoxi Wang 73a72d70fd consider the overall load in the cluster 2022-12-07 08:58:52 -05:00
Xiaoxi Wang c89d74fa1b rewrite loadBytesBalanceRatio; rename knobs; update comments 2022-11-16 12:52:25 -08:00
Xiaoxi Wang ac923cfbcd add knobs; make ppw wait for byte load balance 2022-11-10 12:25:51 -08:00
Xiaoxi Wang 7a5f2973c5 move stopWiggleSignal to StorageWiggler; update workload 2022-11-08 23:02:35 -08:00
Xiaoxi Wang 4727449ef0 Merge branch 'main' of https://github.com/apple/foundationdb into fix/main/restoreStats 2022-11-08 15:35:15 -08:00
Xiaoxi Wang 4971976a61 make trackExcludedServers PRIORITY_SYSTEM_IMMEDIATE 2022-11-07 14:38:04 -08:00
Zhe Wu 56001de2d4 More nit changes around DD 2022-11-07 09:11:16 -08:00
Zhe Wu 3a02f919b9 Add some comments in DD and fix some nit 2022-11-07 09:11:16 -08:00
Xiaoxi Wang 03a9dd009a fix compilation errors 2022-11-06 22:46:54 -08:00
Xiaoxi Wang 8e6a9730ea move StorageWiggleMetrics out; add workload; try to fix the restore/reset bug (not test) 2022-11-03 23:42:44 -07:00
Jingyu Zhou c127bb1c30 Fix some clang warnings on unused variables 2022-11-01 15:38:47 -07:00
Lukas Joswiak 9d3c3b1efe Remove cluster ID logic from individual roles
The logic to determine the validity of a process joining a cluster now
belongs on the worker and the cluster controller. It is no longer
restricted to tlogs and storages, but instead applies to all processes
(even stateless ones).
2022-10-27 13:56:13 -07:00
Xiaoxi Wang 5d90703dc8 finish getKeysLocations etc, and unit test pass. 2022-10-24 09:58:41 -07:00
Jingyu Zhou a8391caf23 Revert "Data loss protection v2" 2022-10-20 18:09:58 -05:00
Lukas Joswiak 72bc89cf39 Remove cluster ID logic from individual roles
The logic to determine the validity of a process joining a cluster now
belongs on the worker and the cluster controller. It is no longer
restricted to tlogs and storages, but instead applies to all processes
(even stateless ones).
2022-10-18 21:37:42 -07:00
Jingyu Zhou 63f5705560
Merge pull request #8414 from sfc-gh-xwang/feature/main/txnProcessor_team
Replace self->cx with self->dbContext() method and trivial renaming
2022-10-10 10:09:36 -07:00
Markus Pilman ea1325a552
Merge pull request #8319 from sfc-gh-tclinkenbeard/add-rare-code-probe-annotation
Add `rare` code probe decoration
2022-10-07 09:39:00 -06:00
Xiaoxi Wang 2ad4f29539 dbContext() replace self->cx, remove cx member 2022-10-04 16:39:22 -07:00
Xiaoxi Wang 21b2e11bc4 getWorkers from IDDTxnProcessor 2022-10-04 14:57:04 -07:00
Xiaoxi Wang 28e170ca69 remove unnecessary Database cx parameters 2022-10-04 14:29:45 -07:00
Xiaoxi Wang 4cf4ccc089 correct getServerListAndProcessClasses implementation (100k pass) 2022-10-03 22:24:35 -07:00
Xiaoxi Wang 76f2dc8ce0 merge upstream/main 2022-10-02 22:07:42 -07:00
Xiaoxi Wang df9b21169d change shared_ptr to Reference 2022-09-27 11:22:47 -07:00
Xiaoxi Wang 14d73193d5 waitDDTeamInfoPrintSignal, getClusterId, tryUpdateReplicasKeyForDc in IDDTxnProcessor 2022-09-26 23:00:31 -07:00
sfc-gh-tclinkenbeard 985958c260 Add rare code probe decoration 2022-09-25 15:28:32 -07:00
Xiaoxi Wang 1194774d54 rename dbProcessor to db; rename getDb() to context() 2022-09-23 15:35:39 -07:00
Xiaoxi Wang 11a6cba2c6 rename dbProcessor to db; readability improvement 2022-09-22 17:11:07 -07:00
Xiaoxi Wang e7a280ec03 format code 2022-09-21 20:49:39 -07:00
Xiaoxi Wang 97fd5878d9 change DDTeamCollection constructor 2022-09-20 13:00:28 -07:00
A.J. Beamon 4fd64630e8 Convert literal string ref instances to use _sr suffix 2022-09-19 11:35:58 -07:00
Xiaoxi Wang dab8bcd109 Merge branch 'main' of https://github.com/apple/foundationdb into feature/main/wiggler-tss 2022-08-12 15:27:50 -07:00