Commit Graph

485 Commits

Author SHA1 Message Date
Evan Tschannen aed2d34bcb Merge branch 'master' into feature-proxy-load-balance
# Conflicts:
#	fdbclient/NativeAPI.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	flow/Knobs.cpp
2020-05-01 09:19:39 -07:00
Evan Tschannen b7f5f3be48 merge in master 2020-04-28 13:11:47 -07:00
Evan Tschannen 0c84ad4bc6
Merge pull request #2917 from bnamasivayam/fail-slow-ss
Mark the storage servers that are continually lagging as unhealthy
2020-04-22 23:18:35 -07:00
Balachandar Namasivayam d5bef6fc32
Update fdbserver/DataDistribution.actor.cpp 2020-04-22 09:45:56 -07:00
Evan Tschannen ba3e2af473 Merge commit '5288033bcfe40c3ade97c8bf2d04cf31b3f16cb1' into feature-tree-broadcast 2020-04-17 15:17:37 -07:00
Evan Tschannen 33efb9ec97 code cleanup based on review comments 2020-04-17 15:05:01 -07:00
Alex Miller 1439de37b5 Convert GetRangeLimits() -> TOO_MANY + ASSERT(). 2020-04-12 18:23:14 -07:00
Evan Tschannen ce4493f679 many bug fixes 2020-04-10 13:45:16 -07:00
Balachandar Namasivayam 6916434f7d Addressed review comments 2020-04-08 10:48:32 -07:00
Balachandar Namasivayam 69ef8a127b Add a backstop mechanism to stop failing too many storage servers when they fall behind. 2020-04-06 23:37:11 -07:00
Alex Miller 6078fd1b18 Convert UID to Tag in keyServers to reduce txnStateStore size 2020-04-05 14:30:09 -07:00
Balachandar Namasivayam 73272fc72e Version difference is now the diff between TLog versions and SS version. 2020-04-03 19:04:43 -07:00
Balachandar Namasivayam a70bfcc3c8 Remove unnecessary comment. 2020-03-31 18:33:12 -07:00
Balachandar Namasivayam b1c3893d40 Fix some corner case bugs exposed by simulation.
In one case, when a SS joins the cluster and DD doesn't find any healthy server to form a team with the newly added server, then the SS does not get added to any team even when the other servers get healthy.
Another is an extreme case where a data center is down, and a SS in the active DC joins and then dies immediately but not before DD adds it to a destination team for a relocating shard which will result in DD waiting indefinitely for the dead data center to come back up for the cluster to be fully recovered.
2020-03-31 18:33:12 -07:00
Balachandar Namasivayam ad1dd4fd9b Mark the storage servers that are continually lagging as unhealthy and so this will give the Data Distributor the chance to move data out of this server. 2020-03-31 18:25:39 -07:00
Evan Tschannen e08f0201f1 merge release 6.2 into master 2020-03-17 12:51:47 -07:00
Evan Tschannen 7adc916e18
Merge pull request #2806 from ajbeamon/improve-team-request-performance
Improve performance of get team requests.
2020-03-16 11:56:45 -07:00
A.J. Beamon 700b13e5f8 Remember the best team from team requests, which will likely be the best again and can save us some computation. 2020-03-13 15:21:33 -07:00
Evan Tschannen 12f2b32770 added additional logging in data distribution 2020-03-13 15:19:33 -07:00
Evan Tschannen 9e99a00c8f fix: do not use priority 0 left when calculating priorities for empty teams 2020-03-13 13:56:46 -07:00
A.J. Beamon 555db50cd1 Avoid calling into SABTF so frequently. Use a cheaper call that only checks that shards exist. 2020-03-12 11:22:03 -07:00
Evan Tschannen 303df197cf Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	bindings/c/test/mako/mako.c
#	documentation/sphinx/source/release-notes.rst
#	fdbbackup/backup.actor.cpp
#	fdbclient/NativeAPI.actor.cpp
#	fdbclient/NativeAPI.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/Knobs.cpp
#	fdbserver/Knobs.h
#	fdbserver/LogRouter.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/fdbserver.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/flow.vcxproj
#	flow/flow.vcxproj.filters
#	versions.target
2020-03-06 18:22:46 -08:00
Evan Tschannen 1128666840 added additional logging on the log router 2020-03-05 18:17:06 -08:00
Evan Tschannen e219c1671f Merge branch 'release-6.2' into feature-dd-region-queue
# Conflicts:
#	fdbserver/Knobs.h
2020-03-04 16:25:38 -08:00
Evan Tschannen 125bd13198 fix: in multi-region configurations, the data distribution queue could start too much work, expecting that the remote region would contribute to the read workload 2020-03-04 14:17:17 -08:00
Evan Tschannen 6296465e07 Make the DD priority associated with populating a remote region lower than machine failures 2020-03-04 14:07:32 -08:00
Evan Tschannen 924d335aa7 Merge branch 'release-6.2'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	flow/Knobs.cpp
#	flow/Knobs.h
2020-02-25 18:25:19 -08:00
Evan Tschannen c05c95cbe8 forgot to rename the knob 2020-02-25 15:47:39 -08:00
Evan Tschannen 96258b9809 Merge branch 'release-6.2'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbcli/fdbcli.actor.cpp
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/FlowTransport.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistribution.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/KeyValueStoreMemory.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/QuietDatabase.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/StorageMetrics.actor.h
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/KVStoreTest.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/genericactors.actor.cpp
#	flow/serialize.h
2020-02-21 19:09:16 -08:00
Evan Tschannen aa4d1357b3 handle the case that there is only one healthy team 2020-02-21 15:41:01 -08:00
Evan Tschannen 457dbc5215
Update fdbserver/DataDistribution.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-02-21 15:39:17 -08:00
Evan Tschannen 6a634652c4
Update fdbserver/DataDistribution.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-02-21 15:39:06 -08:00
Evan Tschannen 08914a2acd Once available space ratio falls below 0.3 avoid moving data to teams with less free space than the median team 2020-02-21 15:14:32 -08:00
Evan Tschannen 819c55556c More aggressively attempt to find teams that do not have low disk space 2020-02-20 16:47:50 -08:00
A.J. Beamon e1fb568fd1 Merge branch 'release-6.2' into dd-use-available-space
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistribution.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
2020-02-20 16:12:42 -08:00
A.J. Beamon e4b483796d Combine some logic that was doing similar computations for free space ratio. 2020-02-20 14:52:08 -08:00
A.J. Beamon 4c9c736253 Data distribution uses available space instead of free space when evaluating whether processes are low on space and penalizing them. 2020-02-20 11:21:03 -08:00
A.J. Beamon 3a1ba5a077 Rename variable for clarity 2020-02-20 10:59:52 -08:00
A.J. Beamon c164acb88d Add new criteria to DD's GetTeamRequest that allow you to require shards be present on the team and that the team have a minimum free ratio. This avoids scenarios where the team chosen when processing the request is later rejected by the requestor, causing rebalancing movements to get stuck. 2020-02-20 09:32:00 -08:00
mpilman 5a9d420cb7 Merge remote-tracking branch 'upstream/release-6.2' into release-merges/20200210 2020-02-10 10:02:05 -08:00
A.J. Beamon b8a252da40 Clarify the names of a couple trace fields 2020-02-10 08:15:00 -08:00
Evan Tschannen 3f9d9d8b84 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	cmake/FlowCommands.cmake
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/StorageServerInterface.h
#	fdbserver/DataDistributionTracker.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	flow/Knobs.h
#	flow/Platform.cpp
#	versions.target
2020-01-16 18:37:47 -08:00
Evan Tschannen 9b80498180 Added a trace event to warn if a shard is merged before enough time has elapses from becoming low bandwidth 2020-01-10 14:58:38 -08:00
Evan Tschannen c2608f0af9 fix: completeSources could be larger than the teamSize, so we need to check all completeSources
we do not need to track bestSize, since all teams in the list will be the same size
2020-01-10 14:46:40 -08:00
Evan Tschannen ab7071932f Data distribution no longer attempts to pick teams which share members of the source unless the team matches exactly 2020-01-09 16:59:37 -08:00
Evan Tschannen 83ad9caf54 implemented a load balancing algorithm which evens out the number of requests processes by each proxy 2020-01-08 01:59:01 -08:00
Evan Tschannen 59738e8ef1 fixed compiler error 2019-11-22 16:19:34 -08:00
Evan Tschannen 3c769fcf60 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	versions.target
2019-11-22 15:39:19 -08:00
Evan Tschannen 3a3ab5664b fix: team trackers for bad teams that contain a removed servers must be cancelled or the cluster will falsely report those teams as failed 2019-11-22 10:20:13 -08:00
Andrew Noyes d4de608bb6 Fix OPEN_FOR_IDE build 2019-10-25 10:42:22 -07:00