Evan Tschannen
c05c95cbe8
forgot to rename the knob
2020-02-25 15:47:39 -08:00
Evan Tschannen
aa4d1357b3
handle the case that there is only one healthy team
2020-02-21 15:41:01 -08:00
Evan Tschannen
457dbc5215
Update fdbserver/DataDistribution.actor.cpp
...
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-02-21 15:39:17 -08:00
Evan Tschannen
6a634652c4
Update fdbserver/DataDistribution.actor.cpp
...
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-02-21 15:39:06 -08:00
Evan Tschannen
08914a2acd
Once available space ratio falls below 0.3 avoid moving data to teams with less free space than the median team
2020-02-21 15:14:32 -08:00
Evan Tschannen
819c55556c
More aggressively attempt to find teams that do not have low disk space
2020-02-20 16:47:50 -08:00
A.J. Beamon
e1fb568fd1
Merge branch 'release-6.2' into dd-use-available-space
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
# fdbserver/DataDistribution.actor.h
# fdbserver/DataDistributionQueue.actor.cpp
2020-02-20 16:12:42 -08:00
A.J. Beamon
e4b483796d
Combine some logic that was doing similar computations for free space ratio.
2020-02-20 14:52:08 -08:00
A.J. Beamon
4c9c736253
Data distribution uses available space instead of free space when evaluating whether processes are low on space and penalizing them.
2020-02-20 11:21:03 -08:00
A.J. Beamon
3a1ba5a077
Rename variable for clarity
2020-02-20 10:59:52 -08:00
A.J. Beamon
c164acb88d
Add new criteria to DD's GetTeamRequest that allow you to require shards be present on the team and that the team have a minimum free ratio. This avoids scenarios where the team chosen when processing the request is later rejected by the requestor, causing rebalancing movements to get stuck.
2020-02-20 09:32:00 -08:00
A.J. Beamon
b8a252da40
Clarify the names of a couple trace fields
2020-02-10 08:15:00 -08:00
Evan Tschannen
9b80498180
Added a trace event to warn if a shard is merged before enough time has elapses from becoming low bandwidth
2020-01-10 14:58:38 -08:00
Evan Tschannen
c2608f0af9
fix: completeSources could be larger than the teamSize, so we need to check all completeSources
...
we do not need to track bestSize, since all teams in the list will be the same size
2020-01-10 14:46:40 -08:00
Evan Tschannen
ab7071932f
Data distribution no longer attempts to pick teams which share members of the source unless the team matches exactly
2020-01-09 16:59:37 -08:00
Evan Tschannen
3a3ab5664b
fix: team trackers for bad teams that contain a removed servers must be cancelled or the cluster will falsely report those teams as failed
2019-11-22 10:20:13 -08:00
Evan Tschannen
f8e44d2f71
fix: If a storage server was offline, it would not be checked for being in an undesired dc
2019-10-23 23:04:39 -07:00
Evan Tschannen
86bcb84b45
Raised the data distribution priority of splitting shards above restoring fault tolerance to avoid hot write shards
2019-10-11 17:50:43 -07:00
Evan Tschannen
4b5080fbea
added a few more missing data distribution priorities
2019-09-27 19:39:53 -07:00
Evan Tschannen
324d0bd3b0
Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-cleanup-mutations
2019-09-27 19:15:14 -07:00
Evan Tschannen
3bb62e008c
lowered the priority of some delays in data distribution so that the process will prefer other work
2019-09-27 18:33:13 -07:00
Meng Xu
32ebd08f9f
DD:Trigger storage recruitment when an invalid address locality is corrected
2019-09-24 13:35:38 -07:00
Meng Xu
515689d07b
Update fdbserver/DataDistribution.actor.cpp
...
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-09-18 14:45:18 -07:00
Meng Xu
d175b61a62
DD:Trace when invalid locality is corrected
...
Change getWorkers param from txn handler to db
2019-09-18 13:52:54 -07:00
Meng Xu
93bbc26a35
set:erase:Use return value of erase iterator as next iterator
2019-09-17 15:28:30 -07:00
Meng Xu
d2fd1f4931
DD:MisconfiguredLocality:Fix review comments
2019-09-17 13:04:21 -07:00
Meng Xu
37d2318eed
DD:Handle worker with incorrect locality
...
When a worker has incorrect locality, the worker will be excluded from
storage recruitment.
When the worker has its locality corrected by system operators,
the worker will be reincluded for storage recruitment.
2019-09-14 12:12:56 -07:00
Meng Xu
c3960aba17
DD:initializeStorage:Exclude worker with invalid locality
2019-09-13 22:05:41 -07:00
Meng Xu
75460089e1
DD_VALIDATE_LOCALITY:Add comment for our future selves
...
When we add simulation test that misconfigure a cluster by not setting some
locality entries, we should set DD_VALIDATE_LOCALITY always true.
Otherwise, simulation tests may fail.
2019-09-13 16:26:54 -07:00
Meng Xu
78b8e48cef
DD:ValidLocality:Resolve review comment
2019-09-13 15:35:16 -07:00
Meng Xu
e1dcdbf3d2
LocalityData:Remove verbose check for valid locality
2019-09-13 15:11:13 -07:00
Meng Xu
8970d9858b
DD:isValidLocality:A generic way to check any replicationPolicy
2019-09-13 14:55:51 -07:00
Meng Xu
1196841b3d
DD:IsValidLocality:Clang format
2019-09-13 13:56:43 -07:00
Meng Xu
e8878b16d4
DD:Valid locality includes an empty but set locality entry
2019-09-13 13:55:46 -07:00
Meng Xu
1596e2e4a5
DD:TCMachine:Use processID as machineID if zoneID is unset
2019-09-13 13:43:41 -07:00
Meng Xu
3ad7e3adb3
DD:DD_VALIDATE_LOCALITY:Guard the checking of locality validity
2019-09-13 13:19:35 -07:00
Meng Xu
90d6a27a0d
DD:IsValidLocality:Consider configured replica policy
2019-09-13 12:04:49 -07:00
Meng Xu
52f6297b52
DD:Introduce isValidLocality
...
A server or machine has a valid locality only if it sets correct
locality entries.
Build teams should only use the valid locality servers or machines
2019-09-13 11:30:26 -07:00
Evan Tschannen
cc41f3e2fc
fix: an unhealthy server with a low number of teams could cause data distribution to build every possible teams
2019-09-12 14:18:10 -07:00
sramamoorthy
5d87443323
improved error msgs for snapshot cmd
2019-08-27 16:43:52 -07:00
Evan Tschannen
297b65236f
added additional trace events to warn when different parts of shard relocations take more than 10 minutes
2019-08-16 14:56:58 -07:00
Evan Tschannen
ba54508c47
code cleanup
2019-08-06 16:30:30 -07:00
Evan Tschannen
5dc4c80d44
fix: the machineAttrition workload did not ensure that healthyZone was always cleared
...
fix: an assert could trigger spuriously
2019-08-05 15:00:17 -07:00
Evan Tschannen
7d7aa27c2d
Merge pull request #1814 from dongxinEric/feature/1508/finer-grained-dd-controls
...
Added finer grained controls to DataDistribution in fdbcli.
2019-07-31 17:36:20 -07:00
Evan Tschannen
bba01c6531
fix: add subsetOfEmergencyTeam could add an unsorted team
2019-07-31 16:02:08 -07:00
Xin Dong
b653ddb30d
Final clean ups after rebasing master
2019-07-30 22:35:34 -07:00
Xin Dong
5d20364423
Address review comments
2019-07-30 22:24:30 -07:00
Xin Dong
1922c39377
Resolve review comments. 100K run shows one suspecious ASSERT_WE_THINK failure which I think could be a race.
2019-07-30 22:24:30 -07:00
Xin Dong
c6e5472d8d
Apply suggestions from code review
...
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-07-30 22:20:45 -07:00
Xin Dong
f5d6e3a5b3
- Addressed review commends
...
- Added test for the storage server failure disable switch
2019-07-30 22:20:45 -07:00