foundationdb

Commit Graph

Author	SHA1	Message	Date
Jingyu Zhou	622520bd2d	Return the source team if remote DC is dead Also refactor the code with findTeamFromServers().	2023-02-10 11:11:07 -08:00
Jingyu Zhou	9aa15b459c	Clean up trace events	2023-02-10 11:11:07 -08:00
Jingyu Zhou	6c4a9b5f23	Fix DD stuck when remote DC is dead When remote DC is down, the remote team collection of DD can initializing waiting for the remote to recover (all_tlog_recruited state). However, the getTeam request can already be served by the remote team collection. So, for a RelocateShard (data movement such as split, move), it will get a team for the remote DC. But the data movement can't make progress on the remote team because the remote DC hasn't recovered yet. Because of the stuck of data movement, the primary cannot reach the "storage_recovered" state and stay in accepting_commit state. The specifc test failure: slow/ApiCorrectness.toml -s 339026305 -b on at commit: `0edd899d65` In this test, primary DC has 1 SS killed, remote DC has 2 TLog and 2 SS killed. So the remote is dead, the remaining 2 SSes can't make progress because of the loss of 2 TLogs. The repairDeadDatacenter() can't reach the "storage_recovered" state due to DD's failure of moving shards away from the killed SS in the primary. The fix is to exclude all remote in repairDeadDatacenter() so that tells DD to mark all SSes in the remote as unhealthy. Another fix is to return empty results for getTeam request if the remote team collection is not ready. This will allow the data movement to continue, essentially remote team is not changed for the data movement.	2023-02-10 11:11:07 -08:00
Yi Wu	d3bc2afc8e	EaR: storage server uses encryption DB config (#9115 ) The PR is updating storage server and Redwood to enable encryption based on the encryption mode in DB config, which was previously controlled by a knob. High level changes are 1. Passing encryption mode in DB config to storage server 1.1 If it is a new storage server, pass the encryption mode through `InitializeStorageRequest`. the encryption mode is pass to Redwood for initialization 1.2 If it is an existing storage server, on restart the storage server will send `GetStorageServerRejoinInfoRequest` to commit proxy, and commit proxy will return the current encryption mode, which it get from DB config on its own initialization. Storage server will compare the DB config encryption mode to the local storage encryption mode, and fail if they don't match 2. Adding a new `encryptionMode()` method to `IKeyValueStore`, which return a future of local encryption mode of the KV store instance. A KV store supporting encryption would need to persist its own encryption mode, and return the mode via the API. 3. Redwood accepts encryption mode from its constructor. For a new Redwood instance, caller has to specific the encryption mode, which will be stored in Redwood per-instance file header. For existing instance, caller is supposed to not passing the encryption mode, and let Redwood find it out from its own header. 4. Refactoring in Redwood to accommodate the above changes.	2023-02-06 14:02:31 -08:00
Xiaoxi Wang	bbcb3cc018	extract KeyBackedConfig, StorageWiggleData class; solve template resolution problem; solve MV txn and native api conflict by splitting RunTransaction file	2023-01-02 23:34:39 -08:00
Xiaoxi Wang	f13453fe63	Merge branch 'main' of https://github.com/apple/foundationdb into feature/main/wiggleDelay	2022-12-20 17:21:19 -08:00
Meng Xu	e6b2254726	Resolve review comments: No functional change	2022-12-19 15:28:01 -08:00
Meng Xu	a1d513b355	Fix:Exclusion stuck because DD cannot build new teams Bug behavior: When DD has zero healthy machine teams but more unhealthy machine teams than the max machine teams DD plans to build, DD will stop building new machine teams. Due to zero healthy machine team (and zero healthy server team), DD cannot find a healthy destination team to relocate data. When data relocation stops, exclusion stops progressing and stuck. Bug happens when we shrink a k-host cluster by first adding k/2 new host; then quickly excluding all old hosts. Fix: Let DD build temporary extra teams to relocate data. The extra teams will be cleaned up later by DD's remove extra teams logic. Simulation test: There is no simulation test to cover cluster expansion scnenario. To most closely simulate this behavior, we intentionally overbuild all possible machine teams to trigger the condition that unhealthy teams is larger than the maximum teams DD wants to build later.	2022-12-19 15:28:01 -08:00
Xiaoxi Wang	a33b366f19	merge feature/main/ppwLoadBalance	2022-12-15 13:27:44 -08:00
Xiaoxi Wang	919c512cdc	fix wiggler state setting	2022-12-15 12:14:40 -08:00
Xiaoxi Wang	ab4778bd19	Merge branch 'main' of https://github.com/apple/foundationdb into feature/main/ppwLoadBalance	2022-12-15 11:36:20 -08:00
Xiaoxi Wang	c12de23824	Merge branch 'main' of https://github.com/apple/foundationdb into feature/main/wiggleDelay	2022-12-11 14:27:22 -08:00
Xiaoxi Wang	3e4966d5bd	persistent perpetual wiggle delay	2022-12-08 23:46:26 -05:00
sfc-gh-tclinkenbeard	68f14f017c	Fix clang 15 compiler warnings	2022-12-08 13:59:37 -08:00
Xiaoxi Wang	ccc494319c	perpetual wiggle key functions	2022-12-08 16:46:05 -05:00
FoundationDB CI	86d6106dc1	format source code after switch to clang 15	2022-12-08 17:26:45 +00:00
Xiaoxi Wang	16d11143fa	add smallLoadThreshold logic and change knobs	2022-12-07 11:45:49 -05:00
Xiaoxi Wang	aae89c863d	DDTeamCollection.getAverageShardBytes	2022-12-07 10:08:22 -05:00
Xiaoxi Wang	5d01d33531	Merge branch 'main' of https://github.com/apple/foundationdb into feature/main/ppwLoadBalance	2022-12-07 09:11:55 -05:00
Xiaoxi Wang	73a72d70fd	consider the overall load in the cluster	2022-12-07 08:58:52 -05:00
Xiaoxi Wang	c89d74fa1b	rewrite loadBytesBalanceRatio; rename knobs; update comments	2022-11-16 12:52:25 -08:00
Xiaoxi Wang	ac923cfbcd	add knobs; make ppw wait for byte load balance	2022-11-10 12:25:51 -08:00
Xiaoxi Wang	7a5f2973c5	move stopWiggleSignal to StorageWiggler; update workload	2022-11-08 23:02:35 -08:00
Xiaoxi Wang	4727449ef0	Merge branch 'main' of https://github.com/apple/foundationdb into fix/main/restoreStats	2022-11-08 15:35:15 -08:00
Xiaoxi Wang	4971976a61	make trackExcludedServers PRIORITY_SYSTEM_IMMEDIATE	2022-11-07 14:38:04 -08:00
Zhe Wu	56001de2d4	More nit changes around DD	2022-11-07 09:11:16 -08:00
Zhe Wu	3a02f919b9	Add some comments in DD and fix some nit	2022-11-07 09:11:16 -08:00
Xiaoxi Wang	03a9dd009a	fix compilation errors	2022-11-06 22:46:54 -08:00
Xiaoxi Wang	8e6a9730ea	move StorageWiggleMetrics out; add workload; try to fix the restore/reset bug (not test)	2022-11-03 23:42:44 -07:00
Jingyu Zhou	c127bb1c30	Fix some clang warnings on unused variables	2022-11-01 15:38:47 -07:00
Lukas Joswiak	9d3c3b1efe	Remove cluster ID logic from individual roles The logic to determine the validity of a process joining a cluster now belongs on the worker and the cluster controller. It is no longer restricted to tlogs and storages, but instead applies to all processes (even stateless ones).	2022-10-27 13:56:13 -07:00
Xiaoxi Wang	5d90703dc8	finish getKeysLocations etc, and unit test pass.	2022-10-24 09:58:41 -07:00
Jingyu Zhou	a8391caf23	Revert "Data loss protection v2"	2022-10-20 18:09:58 -05:00
Lukas Joswiak	72bc89cf39	Remove cluster ID logic from individual roles The logic to determine the validity of a process joining a cluster now belongs on the worker and the cluster controller. It is no longer restricted to tlogs and storages, but instead applies to all processes (even stateless ones).	2022-10-18 21:37:42 -07:00
Jingyu Zhou	63f5705560	Merge pull request #8414 from sfc-gh-xwang/feature/main/txnProcessor_team Replace self->cx with self->dbContext() method and trivial renaming	2022-10-10 10:09:36 -07:00
Markus Pilman	ea1325a552	Merge pull request #8319 from sfc-gh-tclinkenbeard/add-rare-code-probe-annotation Add `rare` code probe decoration	2022-10-07 09:39:00 -06:00
Xiaoxi Wang	2ad4f29539	dbContext() replace self->cx, remove cx member	2022-10-04 16:39:22 -07:00
Xiaoxi Wang	21b2e11bc4	getWorkers from IDDTxnProcessor	2022-10-04 14:57:04 -07:00
Xiaoxi Wang	28e170ca69	remove unnecessary Database cx parameters	2022-10-04 14:29:45 -07:00
Xiaoxi Wang	4cf4ccc089	correct getServerListAndProcessClasses implementation (100k pass)	2022-10-03 22:24:35 -07:00
Xiaoxi Wang	76f2dc8ce0	merge upstream/main	2022-10-02 22:07:42 -07:00
Xiaoxi Wang	df9b21169d	change shared_ptr to Reference	2022-09-27 11:22:47 -07:00
Xiaoxi Wang	14d73193d5	waitDDTeamInfoPrintSignal, getClusterId, tryUpdateReplicasKeyForDc in IDDTxnProcessor	2022-09-26 23:00:31 -07:00
sfc-gh-tclinkenbeard	985958c260	Add rare code probe decoration	2022-09-25 15:28:32 -07:00
Xiaoxi Wang	1194774d54	rename dbProcessor to db; rename getDb() to context()	2022-09-23 15:35:39 -07:00
Xiaoxi Wang	11a6cba2c6	rename dbProcessor to db; readability improvement	2022-09-22 17:11:07 -07:00
Xiaoxi Wang	e7a280ec03	format code	2022-09-21 20:49:39 -07:00
Xiaoxi Wang	97fd5878d9	change DDTeamCollection constructor	2022-09-20 13:00:28 -07:00
A.J. Beamon	4fd64630e8	Convert literal string ref instances to use _sr suffix	2022-09-19 11:35:58 -07:00
Xiaoxi Wang	dab8bcd109	Merge branch 'main' of https://github.com/apple/foundationdb into feature/main/wiggler-tss	2022-08-12 15:27:50 -07:00

1 2 3 4

181 Commits