Commit Graph

879 Commits

Author SHA1 Message Date
Xiaoxi Wang 13bbd062c4 add storage wiggler state 2022-05-08 22:06:11 -07:00
Xiaoxi Wang a3d0b005dc reset several method use getShardMetrics 2022-05-04 00:00:03 -07:00
Xiaoxi Wang 1723bee639 add fetchTopKShardMetrics to dd tracker 2022-05-03 23:42:09 -07:00
Xiaoxi Wang 940512f208 simplify the storageWiggler ordering; add unittests 2022-05-02 12:25:02 -07:00
Xiaoxi Wang 99d2335220 add storeType to metadata; updateStorageMetadata; combine storeTypeTracker 2022-04-23 00:03:57 -07:00
Xiaoxi Wang ed97a35dc0 Merge branch 'main' into readaware 2022-04-12 16:47:15 -07:00
sfc-gh-tclinkenbeard 41c3bb03c3 Fix storageFaultTolerance calculation 2022-04-08 11:35:24 -07:00
sfc-gh-tclinkenbeard 3fcaf4dda3 Account for storage team size when computing storageFaultTolerance 2022-04-08 10:39:44 -07:00
sfc-gh-tclinkenbeard e27b0d9ab5 Merge remote-tracking branch 'origin/main' into improve-snapshot-fault-tolerance 2022-04-07 23:30:16 -07:00
sfc-gh-tclinkenbeard f4a988fe36 Remove unnecessary call to getDatabaseConfiguration in ddSnapCreateCore 2022-04-07 23:27:59 -07:00
sfc-gh-tclinkenbeard 91930b8040 Remove getMinReplicasRemaining PromiseStream.
Instead, in order to enforce the maximum fault tolerance for snapshots,
update getStorageWorkers to return the number of unavailable storage
servers (instead of throwing an error when unavailable storage servers
exist).
2022-04-07 23:23:23 -07:00
Xiaoxi Wang aba9d85560 merge main 2022-04-06 09:57:52 -07:00
sfc-gh-tclinkenbeard 70f378bacc Restrict write access to getUnhealthyRelocationCount 2022-04-03 23:47:54 -07:00
sfc-gh-tclinkenbeard 33fb6ab983 Prevent coordFaultTolerance from dropping below 0 2022-04-03 23:37:42 -07:00
sfc-gh-tclinkenbeard 4f61c86b69 Add MAX_COORDINATOR_SNAPSHOT_FAULT_TOLERANCE knob 2022-04-03 23:28:57 -07:00
sfc-gh-tclinkenbeard 253db642be Add MAX_SNAPSHOT_FAULT_TOLERANCE knob 2022-04-03 22:31:45 -07:00
Steve Atherton 6744e9e4f9 Change timestamps used in storage server metadata and perpetual wiggle metrics to epoch seconds, stored as doubles, and stringified as either floating point epoch seconds or timestamp strings of the form "2013-04-28 20:57:01.000 +0000". 2022-03-30 18:57:06 -07:00
Steve Atherton d6e2d2a1fe Fix nondeterminism in StorageWiggleMetrics caused by use of timer_int(). 2022-03-30 14:48:01 -07:00
Xiaoxi Wang d93b57dd88 conflict solving 2022-03-24 20:45:51 -07:00
Xiaoxi Wang 1b631a9263 solve conflict with main 2022-03-24 16:29:11 -07:00
sfc-gh-tclinkenbeard a71099471b Update copyright header dates 2022-03-21 13:36:23 -07:00
Xiaoxi Wang 4b92f8f546 add relocate reason and set teamSorter in relocator 2022-03-18 16:39:31 -07:00
sfc-gh-tclinkenbeard 58de6e22cc Add BalanceOnRequests boolean parameter for ModelInterface 2022-03-16 14:25:32 -07:00
Xiaoge Su cbd381778e Fix the includes in DataDistribution.actor.cpp
Update the comment to re-trigger failed checks
2022-03-15 18:05:55 -07:00
A.J. Beamon 250a88e682 Enforce that trace event suppression calls happen first when using trace event call chaining. Fix various instances where we weren't following this requirement. 2022-02-24 12:25:52 -08:00
Bharadwaj V.R 36c5d3a1e6
Merge branch 'main' into dd-utest 2022-02-11 12:25:31 -08:00
sfc-gh-tclinkenbeard 9158564bfc Fix formatting 2022-02-11 10:27:41 -08:00
Bharadwaj V.R 41bd39a82a Fix code formatting 2022-02-10 22:10:50 -08:00
Bharadwaj V.R b306288c62
Merge branch 'main' into dd-utest 2022-02-10 22:00:52 -08:00
sfc-gh-tclinkenbeard 2165635478 Make printSnapshotTeamsInfo a static function of DDTeamCollection 2022-02-10 18:45:52 -08:00
sfc-gh-tclinkenbeard 9bc38ae73e Make DDTeamCollection::distributorId private 2022-02-10 18:26:06 -08:00
sfc-gh-tclinkenbeard 14c8483e9d Mark DDTeamCollection::primary private 2022-02-10 18:16:57 -08:00
sfc-gh-tclinkenbeard 641a38bd0b Make more DDTeamCollection methods private.
The methods only used by DDTeamCollection::run can now be made private.
2022-02-10 16:19:32 -08:00
sfc-gh-tclinkenbeard c4508330d2 Make dataDistributionTeamCollection a static function of DDTeamCollection 2022-02-10 16:19:32 -08:00
sfc-gh-tclinkenbeard 5477012ad8 Change DDTeamCollection method signatures to accept references.
Passing nullptr to these methods is invalid, but previously the
signature didn't indicate this. We previously needed to pass pointers
due to actor compiler restrictions, but these restrictions no longer
apply.
2022-02-10 16:19:32 -08:00
sfc-gh-tclinkenbeard 3141698c41 Use special ASSERT_* macros for numeric comparison in data distribution
code.

This helps debugging by printing the exact input values when an
assertion fails.
2022-02-10 11:59:19 -08:00
sfc-gh-tclinkenbeard 975b9f3b32 Remove get helper function from DataDistribution.actor.cpp 2022-02-10 11:32:33 -08:00
Bharadwaj V.R 6d46b03651 Add some unit tests for DD team selection 2022-02-09 22:22:56 -08:00
sfc-gh-tclinkenbeard 04a1347df2 Merge remote-tracking branch 'origin/main' into dd-refactor 2022-02-08 00:33:27 -08:00
Xiaoxi Wang 6dc5921575
createdTime based storage wiggler (#6219)
* add storagemetadata

* add StorageWiggler;

* fix serverMetadataKey bug

* add metadata tracker in storage tracker

* finish StorageWiggler

* update next storage ID

* change pid to server id

* write metadata when seed SS

* add status json fields

* remove pid based ppw iteration

* fix time expression

* fix tss metadata nonexistence; fix transaction retry when retrieving metadata

* fix checkMetadata bug when store type is wrong

* fix remove storage status json

* format code

* refactor updateNextWigglingStoragePID

* seperate storage metadata tracker and store type tracker

* rename pid

* wiggler stats

* fix completion between waitServerListChange and storageRecruiter

* solve review comments

* rename system key

* fix database lock timeout by adding lock_aware

* format code

* status json

* resolve code format/naming comments

* delete expireNow; change PerpetualStorageWiggleID's value to KeyBackedObjectMap<UID, StorageWiggleValue>

* fix omit start rount

* format code

* status json reset

* solve status json format

* improve status json latency; replace binarywriter/reader to objectwriter/reader; refactor storagewigglerstats transactions

* status timestamp
2022-02-04 15:04:30 -08:00
sfc-gh-tclinkenbeard 68ec591cf9 Move DDTeamCollection into its own files 2022-02-04 00:39:42 -08:00
Ata E Husain Bohra 703364d146
Update cluster recovery documentation (#6255)
Patch updates code documentation to reflect the recent code
refactoring where ClusterController process drives recovery
instead of sequencer/master process.
2022-01-18 13:54:00 -08:00
sfc-gh-tclinkenbeard 90ced244eb Fix -Wunused-but-set-variable warnings 2021-12-01 18:15:53 -08:00
Josh Slocum 1870e07ff4 Fixed pause racing with waitUntilHealthy 2021-11-29 14:19:15 -06:00
Evan Tschannen 964d0209ca
Merge pull request #5637 from sfc-gh-ljoswiak/features/data-loss-prevention
Data loss protection when joining new cluster
2021-11-15 15:26:32 -08:00
Ata E Husain Bohra 82c3e8bf79
Trigger buildTeam operation if server transition from unhealthy -> healthy (#5930)
* Trigger buildTeam operation if server transition from unhealthy -> healthy

DataDistribution actor helps in building teams as server count changes
(add/removal), however, it is possible that total_healthy_server count
is insufficient to allow team formation. If happens, even healthy server
count recover, the buildTeam operation will not be triggered.

Patch proposal is to trigger `checkBuildTeam` operation if server
transitions from unhealthy -> healthy state. Incase system already
has created enough teams (desiredTeamCount/maxTeamCount), the operation
incurs a very minimal cost.
2021-11-12 09:41:01 -08:00
Lukas Joswiak 15e0d5b29f Add explicit transaction options when reading cluster ID 2021-11-09 12:29:49 -08:00
Lukas Joswiak 3988b11fd6 Cleanup 2021-11-09 12:29:48 -08:00
Lukas Joswiak 30867750b5 Add protection against storage and tlog data deletion when joining a new cluster 2021-11-09 12:29:47 -08:00
sfc-gh-tclinkenbeard 30cef51746 Improve tracing in ddSnapCreateCore 2021-11-04 12:59:50 -07:00
sfc-gh-tclinkenbeard d0c9cf4fb0 Enable mismatched-tags clang warning 2021-11-01 14:18:31 -07:00
Xiaoxi Wang e4fd0023b7 don't disable machine team remover 2021-10-27 09:08:37 -07:00
Xiaoxi Wang 75ef854563 format 2021-10-27 09:08:37 -07:00
Xiaoxi Wang db7ee9d389 disable team remover 2021-10-27 09:08:37 -07:00
Xiaoxi Wang 14fa32f208 change boolean 2021-10-27 09:08:37 -07:00
Xiaoxi Wang 1a2a838df3 add knob 2021-10-27 09:08:37 -07:00
Xiaoxi Wang c320391c4c restartRecruiting 2021-10-27 09:08:37 -07:00
Xiaoxi Wang dc630d63bd add asyncvar 2021-10-27 09:08:37 -07:00
Xiaoxi Wang 654c0a1f14 format 2021-10-27 09:08:37 -07:00
Xiaoxi Wang 8a10966126 wait extra time 2021-10-27 09:08:37 -07:00
Xiaoxi Wang d1959122af consider wiggling when waitUntilHealthy 2021-10-27 09:08:37 -07:00
Xiaoxi Wang 69190ed04e format 2021-10-27 09:08:37 -07:00
Xiaoxi Wang 0053b4793e change knob and delete redundant doBuildTeam 2021-10-27 09:08:37 -07:00
Xiaoxi Wang db7b48b71c wiggling teams calculation replace 2021-10-27 09:08:37 -07:00
Xiaoxi Wang 3a6359e202 minus wiggling teams when build team 2021-10-27 09:08:37 -07:00
He Liu 16ae2b76e5 Merge branch 'master' of https://github.com/apple/foundationdb into clean-sim-test-data-loss 2021-10-21 09:16:53 -07:00
Trevor Clinkenbeard 504d0b71b2
Fix invalid memory access when dataDistribution actor is cancelled (#5791)
* Fix valgrind error when dataDistribution actor is cancelled

* Trace Sev30 when dataDistribution actor is cancelled outside of simulation

* Rethrow actor_cancelled error in dataDistribution catch block
2021-10-18 14:21:29 -07:00
He Liu dbfeb06c97 Reproduced user data loss incident, and tested the improved exclude tool
can fix the system metadata.
2021-10-14 14:08:39 -07:00
Steve Atherton f339b603a5 Bug fix: printSnapshotTeamsInfo() could crash when looking up status for a storage server that was very recently added because its entry in server_status was not yet created.
Bug fix:  printSnapshotTeamsInfo()'s local server_status map would not see status updates for server UIDs that already existed in the map.
2021-10-10 01:48:31 -07:00
Neethu Haneesha Bingi 3ea7209013 Simulation changes to randomly wiggle with locality filter and review comments. 2021-09-30 10:00:33 -07:00
Neethu Haneesha Bingi 3e79299898 Locality filter support to perpetual storage wiggler feature. 2021-09-30 10:00:33 -07:00
Chang Liu f50a3f08de Fix format problem for file fdbserver/DataDistribution.actor.cpp.
Description

Testing
2021-09-24 15:40:17 -07:00
Chang Liu 8761960cdc fix roll trace event issue for data distribution(master)
Description

Testing
2021-09-24 15:40:17 -07:00
Chang Liu 731c1fffac fix roll trace event issue for data distribution(master)
Description

Testing
2021-09-24 15:40:17 -07:00
Chang Liu 814e3c729f fix roll trace event issue for data distribution(master)
Description

Testing
2021-09-24 15:40:17 -07:00
Chang Liu c10dd0df4b fix roll trace event issue for data distribution
Description

Testing
2021-09-24 15:40:17 -07:00
He Liu a1f6dcc2d5 merge from master 2021-09-22 13:07:38 -07:00
Xiaoxi Wang 1730d75f73 change configure test
add store type check
add test file
2021-09-21 18:11:04 -07:00
He Liu 2246a0bee7 switch to plain random selection 2021-09-20 11:23:54 -07:00
He Liu 4d5bf08da8 address comments 2021-09-19 17:03:06 -07:00
Xiaoge Su abf73047ca Enforce std:: specifier rather than using namespace 2021-09-16 19:40:28 -07:00
He Liu ef7fdc0781 fmt 2021-09-15 10:32:09 -07:00
He Liu c8a3413820 exclude to-be-dropped server from the random team 2021-09-15 09:07:50 -07:00
helium fd6d088945 choose team before removing server 2021-09-14 19:24:59 -07:00
helium 7e53f8662d added comments 2021-09-13 13:28:55 -07:00
helium 8e0b572a18 Added comments. 2021-09-02 22:29:07 -07:00
helium 6612cc00b6 Check if the src server list will be empty before removing a failed server." 2021-09-01 14:52:07 -07:00
Trevor Clinkenbeard 66df75c570
Merge pull request #5385 from sfc-gh-tclinkenbeard/debug-dd
Capture deep copy of `machine_info` in `printSnapshotTeamsInfo`
2021-08-20 13:25:50 -07:00
sfc-gh-tclinkenbeard 9458a6975d Remove std::map::at usage from DataDistribution.actor.cpp 2021-08-20 12:35:26 -07:00
sfc-gh-tclinkenbeard 556e4bc283 Add assertion to overlappingMachineMembers 2021-08-13 14:56:22 -07:00
sfc-gh-tclinkenbeard a0a4207ce2 Capture deep copy of machine_info in printSnapshotTeamsInfo 2021-08-13 12:13:54 -07:00
sfc-gh-tclinkenbeard 3f0e07d79c Remove dead code 2021-08-13 09:54:02 -07:00
sfc-gh-tclinkenbeard ea4f6850da Mark ServerStatus::excludeOnRecruit const 2021-08-13 09:52:07 -07:00
sfc-gh-tclinkenbeard 0aafd9d5f0 Mark TCServerInfo::isCorrectStoreType const 2021-08-13 09:49:22 -07:00
Josh Slocum e444d3781c Various TSS improvements from snowblower testing 2021-08-13 10:24:15 -05:00
sfc-gh-tclinkenbeard 904deb9516 Improve DDTeamCollection const-correctness 2021-08-12 18:52:57 -07:00
sfc-gh-tclinkenbeard cfe677c100 storageRecruiter only responds to changes in recruitStorage endpoint 2021-08-12 16:24:03 -07:00
sfc-gh-tclinkenbeard 45ac667271 Add IsPrimary boolean parameter 2021-08-12 14:05:04 -07:00
Andrew Noyes ca9f60baef Fix heap use after free 2021-08-11 15:42:01 -07:00
Xiaoxi Wang 2337ec7d4e remove unused variables 2021-08-03 10:15:34 -07:00
sfc-gh-tclinkenbeard c74047c665 Merge remote-tracking branch 'origin/master' into fix-more-clang-warnings 2021-07-28 11:51:02 -07:00
Steve Atherton 507c1f11e3 Add .log() to bare TraceEvent() invocations without any .detail()s to avoid clang-tidy warning about immediate destruction of object without use. 2021-07-26 19:55:10 -07:00
sfc-gh-tclinkenbeard a27d7c86f4 Fix more -Wreorder-ctor warnings in DataDistribution.actor.cpp 2021-07-24 22:14:43 -07:00
sfc-gh-tclinkenbeard da50e13f3e Fix more -Wreorder-ctor warnings in DataDistribution.actor.cpp, OldTLogServer_4_6.actor.cpp, and Net2.actor.cpp 2021-07-24 17:33:11 -07:00
sfc-gh-tclinkenbeard 6f81155784 Merge remote-tracking branch 'origin/master' into const-serverdbinfo 2021-07-20 10:18:40 -07:00
Steve Atherton f596a81073 Rename ::TRUE and ::FALSE in BooleanParams to ::True and ::False so as to not conflict with the TRUE and FALSE macros provided by the Windows and MacOS SDKs. 2021-07-17 00:11:40 -07:00
Xiaoxi Wang 501dc339a9 relax perpetual wiggle pause condition; add trace log; correct perpetual wiggle priority setting 2021-07-12 05:46:55 +00:00
sfc-gh-tclinkenbeard 8a212862f0 Prevent dataDistributor from modifying ServerDBInfo object 2021-07-11 22:04:54 -07:00
sfc-gh-tclinkenbeard 79ff07a071 Added *BOOLEAN_PARAM macros to enforce documentation of boolean parameters 2021-07-02 15:04:42 -07:00
Neethu Haneesha Bingi 73752f441b exclude locality:clang-format, ranged loops, documentation, tracking addStoragesever for exclusion. 2021-06-23 18:03:27 -07:00
Neethu Haneesha Bingi 62355571d0 exclude servers based on locality match 2021-06-23 18:03:27 -07:00
Xiaoxi Wang 7b713f7fd2 add knob 2021-06-23 05:49:55 +00:00
Xiaoxi Wang f2daf20927 TEST condition 2021-06-21 06:56:03 +00:00
Xiaoxi Wang 0493d149e6 wait remove 2021-06-21 05:18:42 +00:00
Xiaoxi Wang 783520ce85 add and remove some healthy check to solve cluster status oscillation when #ss is little; simplify some code 2021-06-19 16:57:04 +00:00
Xiaoxi Wang 647138145d adjust default value of stopWiggleSignal; better trace logic 2021-06-17 20:59:47 +00:00
Xiaoxi Wang fdd9c30794 code refactor;change stopSignal; 2021-06-16 05:30:58 +00:00
Xiaoxi Wang d33e43fd2b code format 2021-06-14 23:00:02 +00:00
Xiaoxi Wang 2cd4e6d62f check healthy team count, dd queue and disk space;
code refactor
2021-06-14 22:09:45 +00:00
Xiaoxi Wang d46fccc30f Revert "Revert "Properly set simulation test for perpetual storage wiggle and bug fixing""
This reverts commit ad576e8c20.
2021-06-11 22:58:05 +00:00
Xiaoxi Wang ad576e8c20
Revert "Properly set simulation test for perpetual storage wiggle and bug fixing" 2021-06-11 09:07:45 -07:00
Xiaoxi Wang 17ac91bac4
Merge pull request #4929 from sfc-gh-xwang/ppwtest
Properly set simulation test for perpetual storage wiggle and bug fixing
2021-06-10 14:09:50 -07:00
Xiaoxi Wang cd58c0c149 add useful trace; add invalid wiggling server check 2021-06-10 06:50:44 +00:00
Xiaoxi Wang 4220a548ce use the same health check as exclude to avoid 'best team get stuck' 2021-06-09 22:51:46 +00:00
Xiaoxi Wang 51b4cb89c2 fix server_status bug 2021-06-08 23:47:59 +00:00
Xiaoxi Wang 45ebdb1a9d fix perpetual wiggle bug caused by multiple DCs and removeStorageServer 2021-06-08 23:33:25 +00:00
Xiaoxi Wang 6ab0ea3d0f properly set perpetual_storage_wiggle value during tests 2021-06-07 17:55:20 +00:00
sfc-gh-tclinkenbeard 371a38e6e5 Merge remote-tracking branch 'origin/master' into remove-extra-copies 2021-06-07 10:26:06 -07:00
Xiaoxi Wang 838d847d4e
Merge pull request #4860 from sfc-gh-xwang/ppwtest
implement perpetual storage wiggling feature
2021-06-04 16:18:39 -07:00
Xiaoxi Wang 5be65fab5e add comment 2021-06-04 18:40:18 +00:00
Xiaoxi Wang e0981d6732 add code coverage mark 2021-06-03 19:58:28 +00:00
Xiaoxi Wang 351325b3af comment modification; wait perpetual wiggling close 2021-06-03 05:13:20 +00:00
Xiaoxi Wang 21e175b16c add comments for new actors 2021-06-02 18:49:01 +00:00
Xiaoxi Wang 944c9ad8d9 fix memory bug 2021-06-02 17:53:44 +00:00
Josh Slocum b3e4f182ef TSS Mapping Change 2021-06-02 17:30:09 +00:00
Xiaoxi Wang 9684d78a6e solve recruiting conflict with TSS 2021-06-02 06:12:45 +00:00
Xiaoxi Wang 8b9c8b33fc manually merge with master 2021-06-01 17:51:42 +00:00
Xiaoxi Wang ce308edc5e fix wiggler logic bug 2021-05-26 21:57:58 +00:00
Josh Slocum 4257ac2b4d More TSS Changes/Fixes 2021-05-25 20:37:48 +00:00
Josh Slocum ce82c9653e Testing Storage Server implementation 2021-05-25 20:28:50 +00:00
Xiaoxi Wang e9a23840ea fix promise bug 2021-05-25 20:25:21 +00:00
Xiaoxi Wang f11b7ffa5f merge master, fix promise callback bug 2021-05-25 18:43:08 +00:00
Xiaoxi Wang 7bc55448aa fix iterator bug 2021-05-24 19:11:28 +00:00
Xiaoxi Wang 85cd2b9945 add perpetualStorageWiggler 2021-05-20 23:31:08 +00:00
Xiaoxi Wang 3f3a81b3d9 add pid2server_info to maintain Process id set 2021-05-20 03:32:15 +00:00
sfc-gh-tclinkenbeard f28ac955c3 Remove unnecessary temporary objects while growing objects of type std::vector<std::pair<A, B>> 2021-05-10 16:32:50 -07:00
sfc-gh-tclinkenbeard 5c2d7b6080 Create RangeResult type alias 2021-05-03 13:14:16 -07:00
Trevor Clinkenbeard 0db28f6ea0
Merge pull request #4535 from jzhou77/fix-dd
Fix DD Assertion failed in canBeSet
2021-03-24 10:50:04 -07:00
Jingyu Zhou 0c3bc09524 Remove the shuttingDown flag 2021-03-21 20:12:37 -07:00
Jingyu Zhou cb26576b95 Fix DD assertion failure
This fixes #4493, where DDTeamCollection::~DDTeamCollection creates new teams
that hold pointer to the DDTeamCollection, thus later causes assertion failure
because the memory is invalid.

The fix is to cancel teamBuilder at the begining of the ~DDTeamCollection.
2021-03-21 19:54:44 -07:00