Commit Graph

384 Commits

Author SHA1 Message Date
Evan Tschannen c05c95cbe8 forgot to rename the knob 2020-02-25 15:47:39 -08:00
Evan Tschannen aa4d1357b3 handle the case that there is only one healthy team 2020-02-21 15:41:01 -08:00
Evan Tschannen 457dbc5215
Update fdbserver/DataDistribution.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-02-21 15:39:17 -08:00
Evan Tschannen 6a634652c4
Update fdbserver/DataDistribution.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-02-21 15:39:06 -08:00
Evan Tschannen 08914a2acd Once available space ratio falls below 0.3 avoid moving data to teams with less free space than the median team 2020-02-21 15:14:32 -08:00
Evan Tschannen 819c55556c More aggressively attempt to find teams that do not have low disk space 2020-02-20 16:47:50 -08:00
A.J. Beamon e1fb568fd1 Merge branch 'release-6.2' into dd-use-available-space
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistribution.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
2020-02-20 16:12:42 -08:00
A.J. Beamon e4b483796d Combine some logic that was doing similar computations for free space ratio. 2020-02-20 14:52:08 -08:00
A.J. Beamon 4c9c736253 Data distribution uses available space instead of free space when evaluating whether processes are low on space and penalizing them. 2020-02-20 11:21:03 -08:00
A.J. Beamon 3a1ba5a077 Rename variable for clarity 2020-02-20 10:59:52 -08:00
A.J. Beamon c164acb88d Add new criteria to DD's GetTeamRequest that allow you to require shards be present on the team and that the team have a minimum free ratio. This avoids scenarios where the team chosen when processing the request is later rejected by the requestor, causing rebalancing movements to get stuck. 2020-02-20 09:32:00 -08:00
A.J. Beamon b8a252da40 Clarify the names of a couple trace fields 2020-02-10 08:15:00 -08:00
Evan Tschannen 9b80498180 Added a trace event to warn if a shard is merged before enough time has elapses from becoming low bandwidth 2020-01-10 14:58:38 -08:00
Evan Tschannen c2608f0af9 fix: completeSources could be larger than the teamSize, so we need to check all completeSources
we do not need to track bestSize, since all teams in the list will be the same size
2020-01-10 14:46:40 -08:00
Evan Tschannen ab7071932f Data distribution no longer attempts to pick teams which share members of the source unless the team matches exactly 2020-01-09 16:59:37 -08:00
Evan Tschannen 3a3ab5664b fix: team trackers for bad teams that contain a removed servers must be cancelled or the cluster will falsely report those teams as failed 2019-11-22 10:20:13 -08:00
Evan Tschannen f8e44d2f71 fix: If a storage server was offline, it would not be checked for being in an undesired dc 2019-10-23 23:04:39 -07:00
Evan Tschannen 86bcb84b45 Raised the data distribution priority of splitting shards above restoring fault tolerance to avoid hot write shards 2019-10-11 17:50:43 -07:00
Evan Tschannen 4b5080fbea added a few more missing data distribution priorities 2019-09-27 19:39:53 -07:00
Evan Tschannen 324d0bd3b0 Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-cleanup-mutations 2019-09-27 19:15:14 -07:00
Evan Tschannen 3bb62e008c lowered the priority of some delays in data distribution so that the process will prefer other work 2019-09-27 18:33:13 -07:00
Meng Xu 32ebd08f9f DD:Trigger storage recruitment when an invalid address locality is corrected 2019-09-24 13:35:38 -07:00
Meng Xu 515689d07b
Update fdbserver/DataDistribution.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-09-18 14:45:18 -07:00
Meng Xu d175b61a62 DD:Trace when invalid locality is corrected
Change getWorkers param from txn handler to db
2019-09-18 13:52:54 -07:00
Meng Xu 93bbc26a35 set:erase:Use return value of erase iterator as next iterator 2019-09-17 15:28:30 -07:00
Meng Xu d2fd1f4931 DD:MisconfiguredLocality:Fix review comments 2019-09-17 13:04:21 -07:00
Meng Xu 37d2318eed DD:Handle worker with incorrect locality
When a worker has incorrect locality, the worker will be excluded from
storage recruitment.
When the worker has its locality corrected by system operators,
the worker will be reincluded for storage recruitment.
2019-09-14 12:12:56 -07:00
Meng Xu c3960aba17 DD:initializeStorage:Exclude worker with invalid locality 2019-09-13 22:05:41 -07:00
Meng Xu 75460089e1 DD_VALIDATE_LOCALITY:Add comment for our future selves
When we add simulation test that misconfigure a cluster by not setting some
locality entries, we should set DD_VALIDATE_LOCALITY always true.
Otherwise, simulation tests may fail.
2019-09-13 16:26:54 -07:00
Meng Xu 78b8e48cef DD:ValidLocality:Resolve review comment 2019-09-13 15:35:16 -07:00
Meng Xu e1dcdbf3d2 LocalityData:Remove verbose check for valid locality 2019-09-13 15:11:13 -07:00
Meng Xu 8970d9858b DD:isValidLocality:A generic way to check any replicationPolicy 2019-09-13 14:55:51 -07:00
Meng Xu 1196841b3d DD:IsValidLocality:Clang format 2019-09-13 13:56:43 -07:00
Meng Xu e8878b16d4 DD:Valid locality includes an empty but set locality entry 2019-09-13 13:55:46 -07:00
Meng Xu 1596e2e4a5 DD:TCMachine:Use processID as machineID if zoneID is unset 2019-09-13 13:43:41 -07:00
Meng Xu 3ad7e3adb3 DD:DD_VALIDATE_LOCALITY:Guard the checking of locality validity 2019-09-13 13:19:35 -07:00
Meng Xu 90d6a27a0d DD:IsValidLocality:Consider configured replica policy 2019-09-13 12:04:49 -07:00
Meng Xu 52f6297b52 DD:Introduce isValidLocality
A server or machine has a valid locality only if it sets correct
locality entries.

Build teams should only use the valid locality servers or machines
2019-09-13 11:30:26 -07:00
Evan Tschannen cc41f3e2fc fix: an unhealthy server with a low number of teams could cause data distribution to build every possible teams 2019-09-12 14:18:10 -07:00
sramamoorthy 5d87443323 improved error msgs for snapshot cmd 2019-08-27 16:43:52 -07:00
Evan Tschannen 297b65236f added additional trace events to warn when different parts of shard relocations take more than 10 minutes 2019-08-16 14:56:58 -07:00
Evan Tschannen ba54508c47 code cleanup 2019-08-06 16:30:30 -07:00
Evan Tschannen 5dc4c80d44 fix: the machineAttrition workload did not ensure that healthyZone was always cleared
fix: an assert could trigger spuriously
2019-08-05 15:00:17 -07:00
Evan Tschannen 7d7aa27c2d
Merge pull request #1814 from dongxinEric/feature/1508/finer-grained-dd-controls
Added finer grained controls to DataDistribution in fdbcli.
2019-07-31 17:36:20 -07:00
Evan Tschannen bba01c6531 fix: add subsetOfEmergencyTeam could add an unsorted team 2019-07-31 16:02:08 -07:00
Xin Dong b653ddb30d Final clean ups after rebasing master 2019-07-30 22:35:34 -07:00
Xin Dong 5d20364423 Address review comments 2019-07-30 22:24:30 -07:00
Xin Dong 1922c39377 Resolve review comments. 100K run shows one suspecious ASSERT_WE_THINK failure which I think could be a race. 2019-07-30 22:24:30 -07:00
Xin Dong c6e5472d8d Apply suggestions from code review
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-07-30 22:20:45 -07:00
Xin Dong f5d6e3a5b3 - Addressed review commends
- Added test for the storage server failure disable switch
2019-07-30 22:20:45 -07:00