Commit Graph

687 Commits

Author SHA1 Message Date
Evan Tschannen 2987f85177 fix: known committed version must be updated before creating the tlogQueueEntryRef 2018-06-26 23:21:30 -07:00
Evan Tschannen 00167b0157 renamed some uses of knownCommittedVersion to durableKnownCommittedVersion
epochEnd exclusively refers to the last version a set of logs is responsible for serving peek requests for
recoverAt and recoveredAt refer to the last committed version of the previous generation
2018-06-26 18:20:28 -07:00
Evan Tschannen 6e19622872 fix: durableKnownCommittedVersion was incorrect because we could update knownCommittedVersion before pushing a TLogQueueEntry with that known committed version 2018-06-26 18:02:55 -07:00
Evan Tschannen c6313a79e3 fix: the cluster controller needs to continue to retry recruitment until after wait_for_good_remote_recruitment_delay 2018-06-25 18:20:16 -07:00
Evan Tschannen 1a8dac365d fix: poppedAllAfter was not set to a large enough version 2018-06-25 15:57:11 -07:00
Evan Tschannen 2ec8744ab3 fix: parallel get more needs to verify the begin version matches the end of the previous request, because when a peek cursor expires we lose all history, so the same sequence number could start at different versions 2018-06-25 11:15:49 -07:00
Evan Tschannen 398497f5c3 fix: wrong desired count used when checking good remote fitness 2018-06-22 12:24:01 -07:00
Evan Tschannen 0123627d67 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-06-22 10:43:07 -07:00
Evan Tschannen 96b0a91ab2 simplified betterCount logic 2018-06-22 10:38:36 -07:00
Evan Tschannen 5fc8199abc Swapped OkayFit and UnsetFit, because generally if machine classes are set on one machine they are set everywhere and it helps with wait_for_good_recruitment logic
wait_for_good_recruitment now requires that you have the desired count of each roll
remote recruitment is given a much longer wait_for_good_recruitment time interval, which does not start until enough remote machines have registered
2018-06-22 10:15:24 -07:00
Evan Tschannen 8a8914f046 re-added the ability to configure the number of log routers. Many log routers are needed to get a sufficient number of sockets involved in copying data across the WAN 2018-06-22 00:04:00 -07:00
Evan Tschannen 1dce97f28c Merge branch 'release-5.2'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/SimulatedCluster.actor.cpp
#	packaging/msi/FDBInstaller.wxs
#	versions.target
2018-06-21 17:05:11 -07:00
Evan Tschannen 3943dc1dfc
Merge pull request #521 from ajbeamon/metrics-improvements
Metrics improvements
2018-06-21 16:42:40 -07:00
Evan Tschannen 9a91dad5bd fixed compile issue 2018-06-21 16:34:36 -07:00
Evan Tschannen 678b4494f4 added logging for the datacenter version difference 2018-06-21 16:31:52 -07:00
A.J. Beamon 36f84c9cff Fix uninitialized variable 2018-06-21 16:03:05 -07:00
A.J. Beamon e8f66df001 Add metrics for watches and mutations on the storage server. The storage server tracks its lag with the logs, and status tries to report a more accurate measure of this lag. 2018-06-21 15:59:43 -07:00
Evan Tschannen 8bd7eaebdb fix: broken_promise from push can be throw into the proxy’s actor collection 2018-06-21 15:55:27 -07:00
Evan Tschannen 68ac3bdc4c log routers now calculate a precise version to pop for their log router tag 2018-06-21 15:29:46 -07:00
Evan Tschannen b6ad8b1092 merge master 2018-06-21 12:06:38 -07:00
Alec Grieser e26bae6ec0
Merge pull request #509 from ajbeamon/report-unused-allocated-memory-in-status
Track unused allocated memory in ProcessMetrics and report it in status.
2018-06-20 21:08:39 -07:00
Evan Tschannen 4352db674f the storage server would not always know about all options when running fetch keys 2018-06-20 17:05:03 -07:00
Evan Tschannen f755961c42 use parallelGetMore during log recovery 2018-06-20 17:04:06 -07:00
Richard Low 361e335730 Disable cert validation in simulation 2018-06-20 16:26:11 -07:00
Alex Miller 569fd05954
Merge pull request #506 from etschannen/feature-remote-logs
Performance improvements when using multiple DCs
2018-06-20 12:12:05 -07:00
A.J. Beamon 5e81f4ac7e Track unused allocated memory in ProcessMetrics and report it in status. 2018-06-20 10:10:51 -07:00
Evan Tschannen e951df95b7 when doing data movement where one region has the data and the other doesn’t, first move a single replica to the other region to save WAN bandwidth 2018-06-19 23:15:30 -07:00
Evan Tschannen e7999e7a3e log routers need to use parallelGetMore when peeking because the latency to the primary datacenter makes the bandwidth of normal peeking too low. 2018-06-19 22:16:45 -07:00
Evan Tschannen 56b9fb09f0
Merge pull request #503 from etschannen/feature-remote-logs
fixed a bug with the tlog memory limit
2018-06-18 22:14:26 -07:00
Evan Tschannen df4c445e25 fix: we need to return from commit when stopped 2018-06-18 22:12:46 -07:00
Evan Tschannen 403fb5a2e9 removed unnecessary suppressFor 2018-06-18 17:59:29 -07:00
Alex Miller abf0e68364
Merge pull request #502 from etschannen/feature-remote-logs
fixed a few performance issues with multiple DC deployments
2018-06-18 17:46:37 -07:00
Evan Tschannen eaca0fb2ea fixed incorrect priorities on the log router 2018-06-18 17:36:40 -07:00
Evan Tschannen 0bdd25df23 ratekeeper does not control on remote storage servers 2018-06-18 17:23:55 -07:00
Evan Tschannen b79feaddd3 added a hard memory limit to the TLog to prevent it from running out of memory. Because remote logs are not ratekeeper controlled this is their only protection 2018-06-18 17:22:40 -07:00
Alex Miller aead2586f4 Move .actor.g.cpp files to .obj.
This means that grep over our source tree doesn't return 2x the results.
2018-06-18 16:47:27 -07:00
Evan Tschannen 127c2ad775 fix: prevent adding the same location multiple times for satellite logs 2018-06-18 15:27:28 -07:00
Evan Tschannen 50e1e03130 fix: for configurations with anti-quorums to work, the push actors need to be put in the proxy’s actor collection 2018-06-18 15:25:54 -07:00
Evan Tschannen 1ccfb3a0f4 fix: log_anti_quorum was always 0 in simulation
removed durableStorageQuorum, because it is no longer a useful configuration parameter
2018-06-18 10:24:57 -07:00
Evan Tschannen e8c462882b re-added remote_logs as a parameter, because it could be useful to have a different number of logs between when recruited as primary and remote 2018-06-18 10:22:34 -07:00
Evan Tschannen 0913368651 added usable_regions to specify if we will replicate into a remote region
remote replication defaults to the primary replication
removed remote_logs, because they should be specified as an override in the regions object
2018-06-17 19:31:15 -07:00
Evan Tschannen 7aef5ec6f1 got rid of persistUnrecoveredBefore
added persistLocality
2018-06-17 14:44:33 -07:00
Evan Tschannen f637c680f1 fix: populateSatelliteTagLocations was broken
fix: satellites do not index the upgraded locality
2018-06-17 13:29:17 -07:00
Evan Tschannen 6931a00993 satellite log push locations are static per tag, which will reduce the number of tags each satellite log has to index, and reduce the proxy cpu when calculating push locations 2018-06-16 17:39:02 -07:00
Evan Tschannen f694f7c9ca removed hasBestPolicy 2018-06-15 12:36:19 -07:00
Evan Tschannen 0d87186821 use a specific locality for satellites 2018-06-15 11:06:38 -07:00
Evan Tschannen 09c92c887b fix: extraTlogEligibleMachines was not calculated correctly in all cases 2018-06-15 10:23:33 -07:00
Evan Tschannen 246abd1207 added full_replication to status 2018-06-14 21:14:18 -07:00
Evan Tschannen 284233baa1 added a key in the database with the locality of the current master 2018-06-14 19:36:02 -07:00
Evan Tschannen 0103b6f5ed added datacenter_version_difference to status 2018-06-14 19:09:25 -07:00