Xiaoxi Wang
13bbd062c4
add storage wiggler state
2022-05-08 22:06:11 -07:00
Xiaoxi Wang
a3d0b005dc
reset several method use getShardMetrics
2022-05-04 00:00:03 -07:00
Xiaoxi Wang
1723bee639
add fetchTopKShardMetrics to dd tracker
2022-05-03 23:42:09 -07:00
Xiaoxi Wang
940512f208
simplify the storageWiggler ordering; add unittests
2022-05-02 12:25:02 -07:00
Xiaoxi Wang
99d2335220
add storeType to metadata; updateStorageMetadata; combine storeTypeTracker
2022-04-23 00:03:57 -07:00
Xiaoxi Wang
ed97a35dc0
Merge branch 'main' into readaware
2022-04-12 16:47:15 -07:00
sfc-gh-tclinkenbeard
41c3bb03c3
Fix storageFaultTolerance calculation
2022-04-08 11:35:24 -07:00
sfc-gh-tclinkenbeard
3fcaf4dda3
Account for storage team size when computing storageFaultTolerance
2022-04-08 10:39:44 -07:00
sfc-gh-tclinkenbeard
e27b0d9ab5
Merge remote-tracking branch 'origin/main' into improve-snapshot-fault-tolerance
2022-04-07 23:30:16 -07:00
sfc-gh-tclinkenbeard
f4a988fe36
Remove unnecessary call to getDatabaseConfiguration in ddSnapCreateCore
2022-04-07 23:27:59 -07:00
sfc-gh-tclinkenbeard
91930b8040
Remove getMinReplicasRemaining PromiseStream.
...
Instead, in order to enforce the maximum fault tolerance for snapshots,
update getStorageWorkers to return the number of unavailable storage
servers (instead of throwing an error when unavailable storage servers
exist).
2022-04-07 23:23:23 -07:00
Xiaoxi Wang
aba9d85560
merge main
2022-04-06 09:57:52 -07:00
sfc-gh-tclinkenbeard
70f378bacc
Restrict write access to getUnhealthyRelocationCount
2022-04-03 23:47:54 -07:00
sfc-gh-tclinkenbeard
33fb6ab983
Prevent coordFaultTolerance from dropping below 0
2022-04-03 23:37:42 -07:00
sfc-gh-tclinkenbeard
4f61c86b69
Add MAX_COORDINATOR_SNAPSHOT_FAULT_TOLERANCE knob
2022-04-03 23:28:57 -07:00
sfc-gh-tclinkenbeard
253db642be
Add MAX_SNAPSHOT_FAULT_TOLERANCE knob
2022-04-03 22:31:45 -07:00
Steve Atherton
6744e9e4f9
Change timestamps used in storage server metadata and perpetual wiggle metrics to epoch seconds, stored as doubles, and stringified as either floating point epoch seconds or timestamp strings of the form "2013-04-28 20:57:01.000 +0000".
2022-03-30 18:57:06 -07:00
Steve Atherton
d6e2d2a1fe
Fix nondeterminism in StorageWiggleMetrics caused by use of timer_int().
2022-03-30 14:48:01 -07:00
Xiaoxi Wang
d93b57dd88
conflict solving
2022-03-24 20:45:51 -07:00
Xiaoxi Wang
1b631a9263
solve conflict with main
2022-03-24 16:29:11 -07:00
sfc-gh-tclinkenbeard
a71099471b
Update copyright header dates
2022-03-21 13:36:23 -07:00
Xiaoxi Wang
4b92f8f546
add relocate reason and set teamSorter in relocator
2022-03-18 16:39:31 -07:00
sfc-gh-tclinkenbeard
58de6e22cc
Add BalanceOnRequests boolean parameter for ModelInterface
2022-03-16 14:25:32 -07:00
Xiaoge Su
cbd381778e
Fix the includes in DataDistribution.actor.cpp
...
Update the comment to re-trigger failed checks
2022-03-15 18:05:55 -07:00
A.J. Beamon
250a88e682
Enforce that trace event suppression calls happen first when using trace event call chaining. Fix various instances where we weren't following this requirement.
2022-02-24 12:25:52 -08:00
Bharadwaj V.R
36c5d3a1e6
Merge branch 'main' into dd-utest
2022-02-11 12:25:31 -08:00
sfc-gh-tclinkenbeard
9158564bfc
Fix formatting
2022-02-11 10:27:41 -08:00
Bharadwaj V.R
41bd39a82a
Fix code formatting
2022-02-10 22:10:50 -08:00
Bharadwaj V.R
b306288c62
Merge branch 'main' into dd-utest
2022-02-10 22:00:52 -08:00
sfc-gh-tclinkenbeard
2165635478
Make printSnapshotTeamsInfo a static function of DDTeamCollection
2022-02-10 18:45:52 -08:00
sfc-gh-tclinkenbeard
9bc38ae73e
Make DDTeamCollection::distributorId private
2022-02-10 18:26:06 -08:00
sfc-gh-tclinkenbeard
14c8483e9d
Mark DDTeamCollection::primary private
2022-02-10 18:16:57 -08:00
sfc-gh-tclinkenbeard
641a38bd0b
Make more DDTeamCollection methods private.
...
The methods only used by DDTeamCollection::run can now be made private.
2022-02-10 16:19:32 -08:00
sfc-gh-tclinkenbeard
c4508330d2
Make dataDistributionTeamCollection a static function of DDTeamCollection
2022-02-10 16:19:32 -08:00
sfc-gh-tclinkenbeard
5477012ad8
Change DDTeamCollection method signatures to accept references.
...
Passing nullptr to these methods is invalid, but previously the
signature didn't indicate this. We previously needed to pass pointers
due to actor compiler restrictions, but these restrictions no longer
apply.
2022-02-10 16:19:32 -08:00
sfc-gh-tclinkenbeard
3141698c41
Use special ASSERT_* macros for numeric comparison in data distribution
...
code.
This helps debugging by printing the exact input values when an
assertion fails.
2022-02-10 11:59:19 -08:00
sfc-gh-tclinkenbeard
975b9f3b32
Remove get helper function from DataDistribution.actor.cpp
2022-02-10 11:32:33 -08:00
Bharadwaj V.R
6d46b03651
Add some unit tests for DD team selection
2022-02-09 22:22:56 -08:00
sfc-gh-tclinkenbeard
04a1347df2
Merge remote-tracking branch 'origin/main' into dd-refactor
2022-02-08 00:33:27 -08:00
Xiaoxi Wang
6dc5921575
createdTime based storage wiggler ( #6219 )
...
* add storagemetadata
* add StorageWiggler;
* fix serverMetadataKey bug
* add metadata tracker in storage tracker
* finish StorageWiggler
* update next storage ID
* change pid to server id
* write metadata when seed SS
* add status json fields
* remove pid based ppw iteration
* fix time expression
* fix tss metadata nonexistence; fix transaction retry when retrieving metadata
* fix checkMetadata bug when store type is wrong
* fix remove storage status json
* format code
* refactor updateNextWigglingStoragePID
* seperate storage metadata tracker and store type tracker
* rename pid
* wiggler stats
* fix completion between waitServerListChange and storageRecruiter
* solve review comments
* rename system key
* fix database lock timeout by adding lock_aware
* format code
* status json
* resolve code format/naming comments
* delete expireNow; change PerpetualStorageWiggleID's value to KeyBackedObjectMap<UID, StorageWiggleValue>
* fix omit start rount
* format code
* status json reset
* solve status json format
* improve status json latency; replace binarywriter/reader to objectwriter/reader; refactor storagewigglerstats transactions
* status timestamp
2022-02-04 15:04:30 -08:00
sfc-gh-tclinkenbeard
68ec591cf9
Move DDTeamCollection into its own files
2022-02-04 00:39:42 -08:00
Ata E Husain Bohra
703364d146
Update cluster recovery documentation ( #6255 )
...
Patch updates code documentation to reflect the recent code
refactoring where ClusterController process drives recovery
instead of sequencer/master process.
2022-01-18 13:54:00 -08:00
sfc-gh-tclinkenbeard
90ced244eb
Fix -Wunused-but-set-variable warnings
2021-12-01 18:15:53 -08:00
Josh Slocum
1870e07ff4
Fixed pause racing with waitUntilHealthy
2021-11-29 14:19:15 -06:00
Evan Tschannen
964d0209ca
Merge pull request #5637 from sfc-gh-ljoswiak/features/data-loss-prevention
...
Data loss protection when joining new cluster
2021-11-15 15:26:32 -08:00
Ata E Husain Bohra
82c3e8bf79
Trigger buildTeam operation if server transition from unhealthy -> healthy ( #5930 )
...
* Trigger buildTeam operation if server transition from unhealthy -> healthy
DataDistribution actor helps in building teams as server count changes
(add/removal), however, it is possible that total_healthy_server count
is insufficient to allow team formation. If happens, even healthy server
count recover, the buildTeam operation will not be triggered.
Patch proposal is to trigger `checkBuildTeam` operation if server
transitions from unhealthy -> healthy state. Incase system already
has created enough teams (desiredTeamCount/maxTeamCount), the operation
incurs a very minimal cost.
2021-11-12 09:41:01 -08:00
Lukas Joswiak
15e0d5b29f
Add explicit transaction options when reading cluster ID
2021-11-09 12:29:49 -08:00
Lukas Joswiak
3988b11fd6
Cleanup
2021-11-09 12:29:48 -08:00
Lukas Joswiak
30867750b5
Add protection against storage and tlog data deletion when joining a new cluster
2021-11-09 12:29:47 -08:00
sfc-gh-tclinkenbeard
30cef51746
Improve tracing in ddSnapCreateCore
2021-11-04 12:59:50 -07:00
sfc-gh-tclinkenbeard
d0c9cf4fb0
Enable mismatched-tags clang warning
2021-11-01 14:18:31 -07:00
Xiaoxi Wang
e4fd0023b7
don't disable machine team remover
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
75ef854563
format
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
db7ee9d389
disable team remover
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
14fa32f208
change boolean
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
1a2a838df3
add knob
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
c320391c4c
restartRecruiting
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
dc630d63bd
add asyncvar
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
654c0a1f14
format
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
8a10966126
wait extra time
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
d1959122af
consider wiggling when waitUntilHealthy
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
69190ed04e
format
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
0053b4793e
change knob and delete redundant doBuildTeam
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
db7b48b71c
wiggling teams calculation replace
2021-10-27 09:08:37 -07:00
Xiaoxi Wang
3a6359e202
minus wiggling teams when build team
2021-10-27 09:08:37 -07:00
He Liu
16ae2b76e5
Merge branch 'master' of https://github.com/apple/foundationdb into clean-sim-test-data-loss
2021-10-21 09:16:53 -07:00
Trevor Clinkenbeard
504d0b71b2
Fix invalid memory access when dataDistribution actor is cancelled ( #5791 )
...
* Fix valgrind error when dataDistribution actor is cancelled
* Trace Sev30 when dataDistribution actor is cancelled outside of simulation
* Rethrow actor_cancelled error in dataDistribution catch block
2021-10-18 14:21:29 -07:00
He Liu
dbfeb06c97
Reproduced user data loss incident, and tested the improved exclude tool
...
can fix the system metadata.
2021-10-14 14:08:39 -07:00
Steve Atherton
f339b603a5
Bug fix: printSnapshotTeamsInfo() could crash when looking up status for a storage server that was very recently added because its entry in server_status was not yet created.
...
Bug fix: printSnapshotTeamsInfo()'s local server_status map would not see status updates for server UIDs that already existed in the map.
2021-10-10 01:48:31 -07:00
Neethu Haneesha Bingi
3ea7209013
Simulation changes to randomly wiggle with locality filter and review comments.
2021-09-30 10:00:33 -07:00
Neethu Haneesha Bingi
3e79299898
Locality filter support to perpetual storage wiggler feature.
2021-09-30 10:00:33 -07:00
Chang Liu
f50a3f08de
Fix format problem for file fdbserver/DataDistribution.actor.cpp.
...
Description
Testing
2021-09-24 15:40:17 -07:00
Chang Liu
8761960cdc
fix roll trace event issue for data distribution(master)
...
Description
Testing
2021-09-24 15:40:17 -07:00
Chang Liu
731c1fffac
fix roll trace event issue for data distribution(master)
...
Description
Testing
2021-09-24 15:40:17 -07:00
Chang Liu
814e3c729f
fix roll trace event issue for data distribution(master)
...
Description
Testing
2021-09-24 15:40:17 -07:00
Chang Liu
c10dd0df4b
fix roll trace event issue for data distribution
...
Description
Testing
2021-09-24 15:40:17 -07:00
He Liu
a1f6dcc2d5
merge from master
2021-09-22 13:07:38 -07:00
Xiaoxi Wang
1730d75f73
change configure test
...
add store type check
add test file
2021-09-21 18:11:04 -07:00
He Liu
2246a0bee7
switch to plain random selection
2021-09-20 11:23:54 -07:00
He Liu
4d5bf08da8
address comments
2021-09-19 17:03:06 -07:00
Xiaoge Su
abf73047ca
Enforce std:: specifier rather than using namespace
2021-09-16 19:40:28 -07:00
He Liu
ef7fdc0781
fmt
2021-09-15 10:32:09 -07:00
He Liu
c8a3413820
exclude to-be-dropped server from the random team
2021-09-15 09:07:50 -07:00
helium
fd6d088945
choose team before removing server
2021-09-14 19:24:59 -07:00
helium
7e53f8662d
added comments
2021-09-13 13:28:55 -07:00
helium
8e0b572a18
Added comments.
2021-09-02 22:29:07 -07:00
helium
6612cc00b6
Check if the src server list will be empty before removing a failed server."
2021-09-01 14:52:07 -07:00
Trevor Clinkenbeard
66df75c570
Merge pull request #5385 from sfc-gh-tclinkenbeard/debug-dd
...
Capture deep copy of `machine_info` in `printSnapshotTeamsInfo`
2021-08-20 13:25:50 -07:00
sfc-gh-tclinkenbeard
9458a6975d
Remove std::map::at usage from DataDistribution.actor.cpp
2021-08-20 12:35:26 -07:00
sfc-gh-tclinkenbeard
556e4bc283
Add assertion to overlappingMachineMembers
2021-08-13 14:56:22 -07:00
sfc-gh-tclinkenbeard
a0a4207ce2
Capture deep copy of machine_info in printSnapshotTeamsInfo
2021-08-13 12:13:54 -07:00
sfc-gh-tclinkenbeard
3f0e07d79c
Remove dead code
2021-08-13 09:54:02 -07:00
sfc-gh-tclinkenbeard
ea4f6850da
Mark ServerStatus::excludeOnRecruit const
2021-08-13 09:52:07 -07:00
sfc-gh-tclinkenbeard
0aafd9d5f0
Mark TCServerInfo::isCorrectStoreType const
2021-08-13 09:49:22 -07:00
Josh Slocum
e444d3781c
Various TSS improvements from snowblower testing
2021-08-13 10:24:15 -05:00
sfc-gh-tclinkenbeard
904deb9516
Improve DDTeamCollection const-correctness
2021-08-12 18:52:57 -07:00
sfc-gh-tclinkenbeard
cfe677c100
storageRecruiter only responds to changes in recruitStorage endpoint
2021-08-12 16:24:03 -07:00
sfc-gh-tclinkenbeard
45ac667271
Add IsPrimary boolean parameter
2021-08-12 14:05:04 -07:00
Andrew Noyes
ca9f60baef
Fix heap use after free
2021-08-11 15:42:01 -07:00
Xiaoxi Wang
2337ec7d4e
remove unused variables
2021-08-03 10:15:34 -07:00
sfc-gh-tclinkenbeard
c74047c665
Merge remote-tracking branch 'origin/master' into fix-more-clang-warnings
2021-07-28 11:51:02 -07:00
Steve Atherton
507c1f11e3
Add .log() to bare TraceEvent() invocations without any .detail()s to avoid clang-tidy warning about immediate destruction of object without use.
2021-07-26 19:55:10 -07:00
sfc-gh-tclinkenbeard
a27d7c86f4
Fix more -Wreorder-ctor warnings in DataDistribution.actor.cpp
2021-07-24 22:14:43 -07:00
sfc-gh-tclinkenbeard
da50e13f3e
Fix more -Wreorder-ctor warnings in DataDistribution.actor.cpp, OldTLogServer_4_6.actor.cpp, and Net2.actor.cpp
2021-07-24 17:33:11 -07:00
sfc-gh-tclinkenbeard
6f81155784
Merge remote-tracking branch 'origin/master' into const-serverdbinfo
2021-07-20 10:18:40 -07:00
Steve Atherton
f596a81073
Rename ::TRUE and ::FALSE in BooleanParams to ::True and ::False so as to not conflict with the TRUE and FALSE macros provided by the Windows and MacOS SDKs.
2021-07-17 00:11:40 -07:00
Xiaoxi Wang
501dc339a9
relax perpetual wiggle pause condition; add trace log; correct perpetual wiggle priority setting
2021-07-12 05:46:55 +00:00
sfc-gh-tclinkenbeard
8a212862f0
Prevent dataDistributor from modifying ServerDBInfo object
2021-07-11 22:04:54 -07:00
sfc-gh-tclinkenbeard
79ff07a071
Added *BOOLEAN_PARAM macros to enforce documentation of boolean parameters
2021-07-02 15:04:42 -07:00
Neethu Haneesha Bingi
73752f441b
exclude locality:clang-format, ranged loops, documentation, tracking addStoragesever for exclusion.
2021-06-23 18:03:27 -07:00
Neethu Haneesha Bingi
62355571d0
exclude servers based on locality match
2021-06-23 18:03:27 -07:00
Xiaoxi Wang
7b713f7fd2
add knob
2021-06-23 05:49:55 +00:00
Xiaoxi Wang
f2daf20927
TEST condition
2021-06-21 06:56:03 +00:00
Xiaoxi Wang
0493d149e6
wait remove
2021-06-21 05:18:42 +00:00
Xiaoxi Wang
783520ce85
add and remove some healthy check to solve cluster status oscillation when #ss is little; simplify some code
2021-06-19 16:57:04 +00:00
Xiaoxi Wang
647138145d
adjust default value of stopWiggleSignal; better trace logic
2021-06-17 20:59:47 +00:00
Xiaoxi Wang
fdd9c30794
code refactor;change stopSignal;
2021-06-16 05:30:58 +00:00
Xiaoxi Wang
d33e43fd2b
code format
2021-06-14 23:00:02 +00:00
Xiaoxi Wang
2cd4e6d62f
check healthy team count, dd queue and disk space;
...
code refactor
2021-06-14 22:09:45 +00:00
Xiaoxi Wang
d46fccc30f
Revert "Revert "Properly set simulation test for perpetual storage wiggle and bug fixing""
...
This reverts commit ad576e8c20
.
2021-06-11 22:58:05 +00:00
Xiaoxi Wang
ad576e8c20
Revert "Properly set simulation test for perpetual storage wiggle and bug fixing"
2021-06-11 09:07:45 -07:00
Xiaoxi Wang
17ac91bac4
Merge pull request #4929 from sfc-gh-xwang/ppwtest
...
Properly set simulation test for perpetual storage wiggle and bug fixing
2021-06-10 14:09:50 -07:00
Xiaoxi Wang
cd58c0c149
add useful trace; add invalid wiggling server check
2021-06-10 06:50:44 +00:00
Xiaoxi Wang
4220a548ce
use the same health check as exclude to avoid 'best team get stuck'
2021-06-09 22:51:46 +00:00
Xiaoxi Wang
51b4cb89c2
fix server_status bug
2021-06-08 23:47:59 +00:00
Xiaoxi Wang
45ebdb1a9d
fix perpetual wiggle bug caused by multiple DCs and removeStorageServer
2021-06-08 23:33:25 +00:00
Xiaoxi Wang
6ab0ea3d0f
properly set perpetual_storage_wiggle value during tests
2021-06-07 17:55:20 +00:00
sfc-gh-tclinkenbeard
371a38e6e5
Merge remote-tracking branch 'origin/master' into remove-extra-copies
2021-06-07 10:26:06 -07:00
Xiaoxi Wang
838d847d4e
Merge pull request #4860 from sfc-gh-xwang/ppwtest
...
implement perpetual storage wiggling feature
2021-06-04 16:18:39 -07:00
Xiaoxi Wang
5be65fab5e
add comment
2021-06-04 18:40:18 +00:00
Xiaoxi Wang
e0981d6732
add code coverage mark
2021-06-03 19:58:28 +00:00
Xiaoxi Wang
351325b3af
comment modification; wait perpetual wiggling close
2021-06-03 05:13:20 +00:00
Xiaoxi Wang
21e175b16c
add comments for new actors
2021-06-02 18:49:01 +00:00
Xiaoxi Wang
944c9ad8d9
fix memory bug
2021-06-02 17:53:44 +00:00
Josh Slocum
b3e4f182ef
TSS Mapping Change
2021-06-02 17:30:09 +00:00
Xiaoxi Wang
9684d78a6e
solve recruiting conflict with TSS
2021-06-02 06:12:45 +00:00
Xiaoxi Wang
8b9c8b33fc
manually merge with master
2021-06-01 17:51:42 +00:00
Xiaoxi Wang
ce308edc5e
fix wiggler logic bug
2021-05-26 21:57:58 +00:00
Josh Slocum
4257ac2b4d
More TSS Changes/Fixes
2021-05-25 20:37:48 +00:00
Josh Slocum
ce82c9653e
Testing Storage Server implementation
2021-05-25 20:28:50 +00:00
Xiaoxi Wang
e9a23840ea
fix promise bug
2021-05-25 20:25:21 +00:00
Xiaoxi Wang
f11b7ffa5f
merge master, fix promise callback bug
2021-05-25 18:43:08 +00:00
Xiaoxi Wang
7bc55448aa
fix iterator bug
2021-05-24 19:11:28 +00:00
Xiaoxi Wang
85cd2b9945
add perpetualStorageWiggler
2021-05-20 23:31:08 +00:00
Xiaoxi Wang
3f3a81b3d9
add pid2server_info to maintain Process id set
2021-05-20 03:32:15 +00:00
sfc-gh-tclinkenbeard
f28ac955c3
Remove unnecessary temporary objects while growing objects of type std::vector<std::pair<A, B>>
2021-05-10 16:32:50 -07:00
sfc-gh-tclinkenbeard
5c2d7b6080
Create RangeResult type alias
2021-05-03 13:14:16 -07:00
Trevor Clinkenbeard
0db28f6ea0
Merge pull request #4535 from jzhou77/fix-dd
...
Fix DD Assertion failed in canBeSet
2021-03-24 10:50:04 -07:00
Jingyu Zhou
0c3bc09524
Remove the shuttingDown flag
2021-03-21 20:12:37 -07:00
Jingyu Zhou
cb26576b95
Fix DD assertion failure
...
This fixes #4493 , where DDTeamCollection::~DDTeamCollection creates new teams
that hold pointer to the DDTeamCollection, thus later causes assertion failure
because the memory is invalid.
The fix is to cancel teamBuilder at the begining of the ~DDTeamCollection.
2021-03-21 19:54:44 -07:00