Commit Graph

120 Commits

Author SHA1 Message Date
Jingyu Zhou 8b67a89eed More review comments fixed. 2020-01-22 19:42:13 -08:00
Jingyu Zhou 85c4a4e422 Address review comments for PR #1625 2020-01-22 19:38:45 -08:00
Jingyu Zhou 6c6a553dcc Fix hang due to distributor death in QuietDatabase
It's possible that after obtaining data distributor, the distributor then dies
and a new one is recruited. Because the tester is still contacting the old one,
it becomes stuck.
2020-01-22 19:38:45 -08:00
Jon Fu 471e283128 Merge branch 'master' of https://github.com/apple/foundationdb into mark-ss-failed 2019-09-18 11:49:07 -07:00
Evan Tschannen 8fbd90e2f6
Merge pull request #1985 from xumengpanda/mengxu/storage-engine-switch-PR-v2
Graceful storage engine migration
2019-09-09 13:51:53 -07:00
Meng Xu 879dec1a5d ConsistencyCheck:Check teamCollectionValid for data_hall mode 2019-09-05 10:34:57 -07:00
Jon Fu 00c2025d4b fixed removeKeys impl, adjusted test workload, and introduced extra safety checks to NativeAPI and proxy 2019-08-27 14:39:44 -07:00
Meng Xu a588710376 StorageEngineSwitch:Graceful switch
When fdbcli change storeType for storage engines,
we switch the store type of storage servers one by one gracefully.
This avoids recruiting multiple storage servers on the same process,
which can cause OOM error.
2019-08-12 17:37:52 -07:00
Evan Tschannen 9f11f2ec53 Merge branch 'master' of github.com:apple/foundationdb 2019-07-30 16:55:56 -07:00
Evan Tschannen 2d7ec54d3e fix: some exclude workloads would cause both the primary and remote datacenter to be considered dead 2019-07-30 16:35:52 -07:00
sramamoorthy 5a56f6b456 minor snap create client improvement and bug fixes 2019-07-29 20:28:22 -07:00
Balachandar Namasivayam bf87d906f6 Fix a crash. 2019-07-25 16:15:28 -07:00
sramamoorthy 31a1e6858b remove un-necessary state variables in getCoord 2019-07-24 15:36:28 -07:00
sramamoorthy a65c9f92ed get rid of all timeouts and other changes 2019-07-24 15:36:28 -07:00
sramamoorthy a2f2ad96ff code review comments and merge to master changes 2019-07-24 15:36:28 -07:00
sramamoorthy d90b678f6f storage worker to throw in case of failures 2019-07-24 15:36:28 -07:00
sramamoorthy 7ec8fe6e74 snap v2: implement get only local storage workers 2019-07-24 15:36:28 -07:00
sramamoorthy 8f1f0c0435 snap v2: worker and other helper related changes 2019-07-24 15:36:28 -07:00
Meng Xu 64bee63dbc Resolve two review comments
1) No need to check server with only one team when teamRemover finds
a server team or machine team to remove

2) Fix optimalTeamCount counting in teamTracker
2019-07-18 18:46:31 -07:00
Meng Xu 80ed39c189 QuietDB:Disable check for too many teams
Because team remover does not remove a team if it causes 0 team per server.
So we currently disable the check until we have a better strategy to enforce the
desired number of teams.

This will not cause much problem in real situation, while having 0 team on a server
will make the server unable to host data, which is bad.
2019-07-16 12:38:55 -07:00
Meng Xu 20f067e794 Merge with master:Resolve conflict with PR#1797 2019-07-16 10:52:28 -07:00
Meng Xu 243504b125 DD:Clang format changes 2019-07-15 18:40:14 -07:00
Meng Xu 94e9b8a3b4 Do not remove a team whose min team number is less than target
If the minimum number of teams of servers in a team is less than the
target value (desired_team_number_per_server * (teamSize + 1) / 2),
the team remover should not remove it. Otherwise, DD will oscillate in
building more teams and removing redundant teams.

Do not do consistency check for three_data_hall mode because when
machines are not evenly distributed across data halls, we will
need to build more teams than the total desired number to make sure
the number of teams per server is no less than the target value.
2019-07-15 18:30:13 -07:00
Meng Xu cafe9b9412 TC:Target team num per server is desired number
Do not overbuild teams because we may oscillate between building more teams and
removing the redundant teams. The oscillation happens when the machines are not
evenly distributed across availability zones.
For example, in three_data_hall mode, we have 1 machine in 1 data hall for 2 data halls.
We have 3 machines in the 3rd data hall. To build enough (and more teams) for servers
in the 3rd data hall, we will overbuild teams. However,
the teamRemover will remove those newly teams.
2019-07-15 17:32:51 -07:00
Meng Xu cf935ff9e6 Remove debug message and format code 2019-07-11 22:05:20 -07:00
Meng Xu cd28a0b604 Reenable check each server must have at least 1 team 2019-07-11 17:58:14 -07:00
Meng Xu 221e6945db TeamTracker:Fix bug in counting optimalTeamCount
When a teamTracker is cancelled, e.g, by redundant teamRemover or badTeamRemover,
we should decrease the optimalTeamCount if the team is considered as an
optimal team, i.e., all members' machine fitness is no worse than unset, and
the team is healthy.
2019-07-11 17:22:41 -07:00
Meng Xu 4c32593f59 QuietDB:Do not check when machineId is not zoneID 2019-07-11 10:37:16 -07:00
Meng Xu 4fae510633 AddBestMachineTeams:BugFix:Must build team when it has remainingMachineTeamBudget 2019-07-10 11:55:06 -07:00
Meng Xu 9816fb6aca ConsistencyCheck:Check minServerTeamOnServer larger than 0 2019-07-10 11:53:47 -07:00
Meng Xu 522230f050 ConsistencyCheck:getTeamCollectionValid tries 10 times before return false
Because serverTeamRemover takes time to remove teams,
getTeamCollectionValid() need to wait for a while before concluding that
the number of server teams is larger than the desired number.
2019-07-09 11:46:57 -07:00
Meng Xu cf03b274a2 TeamTracker:Add traceTeamCollectionInfo 2019-07-08 23:01:25 -07:00
Meng Xu 08d76a7bbe ServerTeamRemover:Bug fix and clang-format 2019-07-08 17:08:32 -07:00
Meng Xu 9cc11e88c5 TeamBuilder:Reduce unnecessary calculation of remainingTeamBudget 2019-07-08 16:56:06 -07:00
Meng Xu 08a721b320 Merge branch 'master' into mengxu/server-team-remover-PR 2019-07-08 16:30:32 -07:00
A.J. Beamon 0a5c7608df Remove "Number" suffix from newly added events (and variables that feed the events). 2019-07-08 15:45:28 -07:00
A.J. Beamon f52c239ef8 Merge branch 'master' into trace-event-rename
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/QuietDatabase.actor.cpp
2019-07-08 15:37:00 -07:00
A.J. Beamon 2a56e011ea Merge branch 'release-6.1' into merge-release-6.1-into-master
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/DataDistribution.actor.cpp
2019-07-05 13:52:29 -07:00
A.J. Beamon 2a709ee5d0 Rename event details that use the suffix "Number" to indicate a count, as number could also imply an index. Rename a few other trace events and details that e.g. needed to be pluralized. 2019-07-05 08:54:21 -07:00
Meng Xu 599fcb2e6d Add serverTeamRemover to remove redundant server teams 2019-07-02 17:40:37 -07:00
Meng Xu 716494ed9f ConsistencyCheck:Check serverTeamNumber larger than desired number 2019-07-02 17:40:37 -07:00
Meng Xu 875cb877ac TeamCollection: Apply clang-format 2019-06-28 16:01:05 -07:00
Meng Xu 0baae134f6 TeamCollectionInfo: Resolve review comments 2019-06-28 15:59:47 -07:00
Meng Xu 4da345f7d2 TeamCollectionTest:Remove test on minTeamOnServer 2019-06-27 19:05:10 -07:00
Meng Xu f889843332 Change traceTeamCollectionInfo to actor
There are cases where traceTeamCollectionInfo was called within the same execution block, i.e.,
no wait between the two traceTeamCollectionInfo calls.
Because simulation uses the same time for all execution instructions in the same execution block,
having more than one traceTeamCollectionInfo at the same time will mess up the trackLatest semantics.
When one of them is always chosen by simulator, simulation test will report false positive error.

Changing this function to actor and adding a small delay inside the function can solve this problem.
2019-06-27 18:24:20 -07:00
Meng Xu 4fe3c7f749 TeamCollectionInfo:Revert to original version where it is 2019-06-27 17:09:21 -07:00
Meng Xu 42620e4831 TeamCollectionTest:GetTeamCollectionValid wait until values are correct 2019-06-27 16:52:36 -07:00
Meng Xu 8d5e848808 QuitDatabase test: Check each server has at least 1 team 2019-06-27 14:22:41 -07:00
Meng Xu 53324e4db7 TeamCollectionInfo: clang format 2019-06-27 11:27:29 -07:00
Meng Xu cc6a0e9bcd TeamCollectionTest:Do not enforce minServerTeamOnServer larger than 0
In ConfigureTest, one server may be left with 0 server teams, even if
we call buildTeams in the storageServerTracker.
2019-06-27 11:27:29 -07:00