Commit Graph

826 Commits

Author SHA1 Message Date
Evan Tschannen 1022e0a5c6 added yields to the log router and tlogs after processing a version 2018-09-04 17:16:44 -07:00
Evan Tschannen 21f5cf9ce9 suppress spammy trace events 2018-09-04 17:12:26 -07:00
Evan Tschannen 65eabedb6c fix: addSubsetOfEmergencyTeams could add unhealthy teams
optimized teamTracker to check if it satisfies the policy more efficiently
added yields to initialization to avoid slow tasks when adding lots of teams
2018-08-31 17:54:55 -07:00
Evan Tschannen 72c86e909e fix: tracking of the number of unhealthy servers was incorrect
fix: locality equality was only checking zoneId
2018-08-31 17:40:27 -07:00
Evan Tschannen 90bf277206 require key value store memory to recover cleanly when recovering the txnStateStore, since all of the data it is recovering has been fsync’ed 2018-08-31 13:07:48 -07:00
Evan Tschannen 1e2ce75ce4 fix: if usable_regions=1 extraTlogEligibleMachines was calculated incorrectly 2018-08-31 13:04:00 -07:00
Evan Tschannen b0d94597d4 Add additional metrics to track fetch key duration on the storage servers 2018-08-31 13:01:36 -07:00
Evan Tschannen d8659a5822 fix: bytesWritten would overflow and go negative 2018-08-31 12:46:57 -07:00
Evan Tschannen 6496a6d9c8 fix: start move keys will only move destination servers to become source servers if less than destination servers are healthy and the total number of sources is less than 2x the number of destinations 2018-08-31 12:43:14 -07:00
Evan Tschannen e60c668853 The cluster controller will increase its failure monitoring delay after there have been many unfinishedRecoveries 2018-08-31 10:51:55 -07:00
Evan Tschannen 84e1f7b2b5 added overhead bytes durable to complement overhead bytes input 2018-08-21 22:35:04 -07:00
Evan Tschannen 74f7412975 added separate logging for overhead bytes 2018-08-21 22:18:38 -07:00
Evan Tschannen ffde1a0e28 renamed onlySystem to mustContainSystemMutations, to accurately represent what setting the key does 2018-08-21 22:15:45 -07:00
Evan Tschannen d7c01f0419 added a separate knob for tlog’s recoverMemoryLimit 2018-08-21 21:11:23 -07:00
Evan Tschannen cb60002944 Added the ability to disable all commits which do not modify the system keys by setting \xff/onlySystem = 1 in the database 2018-08-21 21:09:50 -07:00
Evan Tschannen a694364a39 fix: teams larger than the storageTeamSize can never become healthy, so we do not need to track them in our data structures. After configuring from usable_regions=2 to usable_regions=1 we will have a lot of these types of teams, leading to performance issues 2018-08-21 21:08:15 -07:00
Evan Tschannen e770629229 fix: json_spirit::write_string is very CPU intensive, especially for large JSON documents. The cluster controller would call this function for each status reply it needed to send, resulting in a slow task. 2018-08-15 19:39:06 -07:00
Evan Tschannen 883050d12f moved the creation of the yieldPromiseStream to properly yield moves from initialDataDistribution 2018-08-13 22:29:55 -07:00
Evan Tschannen f52d841e8a we need to send notifications when the leader fitness becomes worse so that we repopulate availableCandidates to compare with the new lower fitness 2018-08-13 20:56:02 -07:00
Evan Tschannen 2341e5d8ad fix: we must yield when updating shardsAffectedByTeamFailure with the initial shards. A test with 1 million shards caused a 22 second slow task 2018-08-13 19:46:47 -07:00
Evan Tschannen 8fc8aa0493 fix: we must notify every time nextNominee is not present to continue to repopulate availableCandidates 2018-08-13 17:59:47 -07:00
Evan Tschannen aaa90de7d9 merge 5.2 into 6.0 2018-08-13 10:13:03 -07:00
Evan Tschannen 4f9dd10644 fix: as long as some leader was sending heartbeats we would keep the currentNominee as leader, even if that currentNominee was not the one sending the heartbeats 2018-08-10 17:11:24 -07:00
Evan Tschannen 9c918a28f6 fix: status was reporting no replicas remaining when the remote datacenter was initially configured with usable_regions=2 2018-08-09 13:16:09 -07:00
Evan Tschannen 7c5d414f7b fix: during destruction logData could attempt to dereference tLogData after it has been deleted 2018-08-09 12:38:35 -07:00
Evan Tschannen 6f02ea843a prevented a slow task when too many shards were sent to the data distribution queue after switching to a fearless deployment 2018-08-09 12:37:46 -07:00
Evan Tschannen 7f7755165c slowly send notifications to clients to clear the list of dead clients 2018-08-08 17:29:32 -07:00
Evan Tschannen 0ca11aabe6 Merge branch 'release-6.0' of github.com:apple/foundationdb into release-6.0 2018-08-07 17:23:52 -07:00
Evan Tschannen 3bb8dad431 TooManyNotifications is only sevWanAlways if it happens more than once a day. Status continuously adds to notifications currently, so we expect this to trigger every 4-5 days. 2018-08-07 17:00:43 -07:00
A.J. Beamon 9b1f7408d5
Merge pull request #678 from ajbeamon/use-new-data-lag-fields
Fix: use new data lag fields when making storage server message indicating high lag.
2018-08-07 15:42:23 -07:00
A.J. Beamon 7d831ef9c3 Revert change that prints lag with 2 decimal points of precision. 2018-08-07 15:41:51 -07:00
A.J. Beamon e0cf525951 Fix: use new data lag fields when making storage server message indicating high lag. 2018-08-07 11:02:09 -07:00
Evan Tschannen 6f328d41ac suppressed spammy trace events 2018-08-06 12:12:55 -07:00
Evan Tschannen c757c68bfa fix: nextVersion needs to be set to logData->version if version_sizes is empty 2018-08-04 23:53:37 -07:00
Evan Tschannen 9d0a07a400 fix: trackLatest for master recovery state was wrong, causing status to report incorrect recovery states 2018-08-04 12:50:56 -07:00
Evan Tschannen fec285146c significant cpu optimization in update storage 2018-08-04 12:36:48 -07:00
Evan Tschannen be1a4d74c7 tlogs serve reads to log routers at a low priority, to prevent them from using all their resources catching up a remote dc that has been down for a long time
increase the amount of memory ratekeeper budgets for tlogs so that there is a gap after the spill threshold to prevent temporarily overshooting the budget
2018-08-04 10:31:30 -07:00
Evan Tschannen 71f89f372f changed a trace event name to avoid scope type mismatch on the tag field 2018-08-03 15:53:38 -07:00
Evan Tschannen 2619234477 Merge branch 'release-5.2' into release-6.0
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2018-08-03 11:40:24 -07:00
Evan Tschannen 501033c5af fix: tlog spilling on a stopped log was only making one version durable at a time 2018-08-03 11:38:12 -07:00
Evan Tschannen 1c29275672 call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details. 2018-08-01 14:30:57 -07:00
Evan Tschannen 57f121481c reverted killing processes because of io_error, we should fix the problem in a better way in the future 2018-07-16 15:09:07 -07:00
Evan Tschannen f72a9f60c0 only disable fearless if a datacenter has actually been killed
fix: we must prevent recovery into the dead datacenter while reducing usable_regions
2018-07-16 10:06:57 -07:00
Evan Tschannen 30b2f85020 fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs 2018-07-14 16:26:45 -07:00
Evan Tschannen 0f59dc4086 fix: do not write to the persistent queue when we are terminated, which could happen if shutdown was caused by setting a promise in the asyncPullData loop 2018-07-13 17:01:31 -07:00
Evan Tschannen 10ae883a68 changed the location of a yield 2018-07-12 17:59:12 -07:00
Evan Tschannen 4fedd05506 added more yields to avoid slow tasks 2018-07-12 17:47:35 -07:00
Evan Tschannen d47aae27f3 added a yield to getMore() 2018-07-12 16:27:27 -07:00
Evan Tschannen 392c73affb fixed a few slow tasks 2018-07-12 14:06:59 -07:00
Evan Tschannen d12dac60ec fix: the same team was being added multiple times to primaryTeams 2018-07-12 12:10:18 -07:00