Evan Tschannen
1022e0a5c6
added yields to the log router and tlogs after processing a version
2018-09-04 17:16:44 -07:00
Evan Tschannen
21f5cf9ce9
suppress spammy trace events
2018-09-04 17:12:26 -07:00
Evan Tschannen
65eabedb6c
fix: addSubsetOfEmergencyTeams could add unhealthy teams
...
optimized teamTracker to check if it satisfies the policy more efficiently
added yields to initialization to avoid slow tasks when adding lots of teams
2018-08-31 17:54:55 -07:00
Evan Tschannen
72c86e909e
fix: tracking of the number of unhealthy servers was incorrect
...
fix: locality equality was only checking zoneId
2018-08-31 17:40:27 -07:00
Evan Tschannen
90bf277206
require key value store memory to recover cleanly when recovering the txnStateStore, since all of the data it is recovering has been fsync’ed
2018-08-31 13:07:48 -07:00
Evan Tschannen
1e2ce75ce4
fix: if usable_regions=1 extraTlogEligibleMachines was calculated incorrectly
2018-08-31 13:04:00 -07:00
Evan Tschannen
b0d94597d4
Add additional metrics to track fetch key duration on the storage servers
2018-08-31 13:01:36 -07:00
Evan Tschannen
d8659a5822
fix: bytesWritten would overflow and go negative
2018-08-31 12:46:57 -07:00
Evan Tschannen
6496a6d9c8
fix: start move keys will only move destination servers to become source servers if less than destination servers are healthy and the total number of sources is less than 2x the number of destinations
2018-08-31 12:43:14 -07:00
Evan Tschannen
e60c668853
The cluster controller will increase its failure monitoring delay after there have been many unfinishedRecoveries
2018-08-31 10:51:55 -07:00
Evan Tschannen
84e1f7b2b5
added overhead bytes durable to complement overhead bytes input
2018-08-21 22:35:04 -07:00
Evan Tschannen
74f7412975
added separate logging for overhead bytes
2018-08-21 22:18:38 -07:00
Evan Tschannen
ffde1a0e28
renamed onlySystem to mustContainSystemMutations, to accurately represent what setting the key does
2018-08-21 22:15:45 -07:00
Evan Tschannen
d7c01f0419
added a separate knob for tlog’s recoverMemoryLimit
2018-08-21 21:11:23 -07:00
Evan Tschannen
cb60002944
Added the ability to disable all commits which do not modify the system keys by setting \xff/onlySystem = 1 in the database
2018-08-21 21:09:50 -07:00
Evan Tschannen
a694364a39
fix: teams larger than the storageTeamSize can never become healthy, so we do not need to track them in our data structures. After configuring from usable_regions=2 to usable_regions=1 we will have a lot of these types of teams, leading to performance issues
2018-08-21 21:08:15 -07:00
Evan Tschannen
e770629229
fix: json_spirit::write_string is very CPU intensive, especially for large JSON documents. The cluster controller would call this function for each status reply it needed to send, resulting in a slow task.
2018-08-15 19:39:06 -07:00
Evan Tschannen
883050d12f
moved the creation of the yieldPromiseStream to properly yield moves from initialDataDistribution
2018-08-13 22:29:55 -07:00
Evan Tschannen
f52d841e8a
we need to send notifications when the leader fitness becomes worse so that we repopulate availableCandidates to compare with the new lower fitness
2018-08-13 20:56:02 -07:00
Evan Tschannen
2341e5d8ad
fix: we must yield when updating shardsAffectedByTeamFailure with the initial shards. A test with 1 million shards caused a 22 second slow task
2018-08-13 19:46:47 -07:00
Evan Tschannen
8fc8aa0493
fix: we must notify every time nextNominee is not present to continue to repopulate availableCandidates
2018-08-13 17:59:47 -07:00
Evan Tschannen
aaa90de7d9
merge 5.2 into 6.0
2018-08-13 10:13:03 -07:00
Evan Tschannen
4f9dd10644
fix: as long as some leader was sending heartbeats we would keep the currentNominee as leader, even if that currentNominee was not the one sending the heartbeats
2018-08-10 17:11:24 -07:00
Evan Tschannen
9c918a28f6
fix: status was reporting no replicas remaining when the remote datacenter was initially configured with usable_regions=2
2018-08-09 13:16:09 -07:00
Evan Tschannen
7c5d414f7b
fix: during destruction logData could attempt to dereference tLogData after it has been deleted
2018-08-09 12:38:35 -07:00
Evan Tschannen
6f02ea843a
prevented a slow task when too many shards were sent to the data distribution queue after switching to a fearless deployment
2018-08-09 12:37:46 -07:00
Evan Tschannen
7f7755165c
slowly send notifications to clients to clear the list of dead clients
2018-08-08 17:29:32 -07:00
Evan Tschannen
0ca11aabe6
Merge branch 'release-6.0' of github.com:apple/foundationdb into release-6.0
2018-08-07 17:23:52 -07:00
Evan Tschannen
3bb8dad431
TooManyNotifications is only sevWanAlways if it happens more than once a day. Status continuously adds to notifications currently, so we expect this to trigger every 4-5 days.
2018-08-07 17:00:43 -07:00
A.J. Beamon
9b1f7408d5
Merge pull request #678 from ajbeamon/use-new-data-lag-fields
...
Fix: use new data lag fields when making storage server message indicating high lag.
2018-08-07 15:42:23 -07:00
A.J. Beamon
7d831ef9c3
Revert change that prints lag with 2 decimal points of precision.
2018-08-07 15:41:51 -07:00
A.J. Beamon
e0cf525951
Fix: use new data lag fields when making storage server message indicating high lag.
2018-08-07 11:02:09 -07:00
Evan Tschannen
6f328d41ac
suppressed spammy trace events
2018-08-06 12:12:55 -07:00
Evan Tschannen
c757c68bfa
fix: nextVersion needs to be set to logData->version if version_sizes is empty
2018-08-04 23:53:37 -07:00
Evan Tschannen
9d0a07a400
fix: trackLatest for master recovery state was wrong, causing status to report incorrect recovery states
2018-08-04 12:50:56 -07:00
Evan Tschannen
fec285146c
significant cpu optimization in update storage
2018-08-04 12:36:48 -07:00
Evan Tschannen
be1a4d74c7
tlogs serve reads to log routers at a low priority, to prevent them from using all their resources catching up a remote dc that has been down for a long time
...
increase the amount of memory ratekeeper budgets for tlogs so that there is a gap after the spill threshold to prevent temporarily overshooting the budget
2018-08-04 10:31:30 -07:00
Evan Tschannen
71f89f372f
changed a trace event name to avoid scope type mismatch on the tag field
2018-08-03 15:53:38 -07:00
Evan Tschannen
2619234477
Merge branch 'release-5.2' into release-6.0
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
2018-08-03 11:40:24 -07:00
Evan Tschannen
501033c5af
fix: tlog spilling on a stopped log was only making one version durable at a time
2018-08-03 11:38:12 -07:00
Evan Tschannen
1c29275672
call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details.
2018-08-01 14:30:57 -07:00
Evan Tschannen
57f121481c
reverted killing processes because of io_error, we should fix the problem in a better way in the future
2018-07-16 15:09:07 -07:00
Evan Tschannen
f72a9f60c0
only disable fearless if a datacenter has actually been killed
...
fix: we must prevent recovery into the dead datacenter while reducing usable_regions
2018-07-16 10:06:57 -07:00
Evan Tschannen
30b2f85020
fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs
2018-07-14 16:26:45 -07:00
Evan Tschannen
0f59dc4086
fix: do not write to the persistent queue when we are terminated, which could happen if shutdown was caused by setting a promise in the asyncPullData loop
2018-07-13 17:01:31 -07:00
Evan Tschannen
10ae883a68
changed the location of a yield
2018-07-12 17:59:12 -07:00
Evan Tschannen
4fedd05506
added more yields to avoid slow tasks
2018-07-12 17:47:35 -07:00
Evan Tschannen
d47aae27f3
added a yield to getMore()
2018-07-12 16:27:27 -07:00
Evan Tschannen
392c73affb
fixed a few slow tasks
2018-07-12 14:06:59 -07:00
Evan Tschannen
d12dac60ec
fix: the same team was being added multiple times to primaryTeams
2018-07-12 12:10:18 -07:00