Commit Graph

946 Commits

Author SHA1 Message Date
Evan Tschannen 0217aed74c Merge branch 'release-6.0'
# Conflicts:
#	bindings/go/README.md
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/MasterProxyServer.actor.cpp
#	versions.target
2018-10-15 18:38:51 -07:00
Evan Tschannen 0acfae1e76 fixed the windows linker error 2018-10-15 18:19:51 -07:00
Evan Tschannen a8feecbfad added a comment to explain code ordering 2018-10-12 16:27:13 -07:00
Evan Tschannen 8ed4ce183c Merge branch 'release-6.0' of github.com:apple/foundationdb into release-6.0 2018-10-12 14:56:19 -07:00
Evan Tschannen 17a1e3ce35 fix: the master proxy would log an OpCommit for empty commits to the txnStateStore 2018-10-12 12:58:17 -07:00
A.J. Beamon 419231d798 Fix: status was trying to read a metric under the wrong name, leading to an error that caused the cluster to report itself unhealthy and some metrics to be missing. 2018-10-10 13:33:28 -07:00
Evan Tschannen 4c95a5ee0f added the basic structure for parallel restore 2018-10-09 18:47:28 -07:00
Evan Tschannen ecddeab2ae fixed review comments; demote killRegionCycle test for now 2018-10-08 10:39:39 -07:00
Evan Tschannen 1314bcec9e Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2018-10-05 12:54:00 -07:00
Evan Tschannen 47e31133aa fix: only create a new version if the version has not been created before 2018-10-05 12:37:29 -07:00
Evan Tschannen 06be70bace fix: if localEnd is smaller than begin, we cannot peek from the local dc 2018-10-05 12:36:34 -07:00
Evan Tschannen daed31708b fix: we can only repair dead DCs if we have a fearless configuration 2018-10-05 12:35:37 -07:00
Evan Tschannen 3922e477a5 Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/ManagementAPI.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/LogSystemDiskQueueAdapter.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
2018-10-03 16:57:18 -07:00
Evan Tschannen 9de55f362b
Merge pull request #793 from ajbeamon/add-new-storage-status-metrics
Add new metrics for bytes queried, keys queried, mutation bytes, muta…
2018-10-03 16:34:26 -07:00
Evan Tschannen 598788f60b
Merge pull request #801 from etschannen/feature-fix-forced-recovery
Fixed a number of problems with forced recoveries
2018-10-03 16:32:03 -07:00
Evan Tschannen 636420abee fix: if the disk queue adapter peek hangs for a while, switch to a peek from a different locality 2018-10-03 13:58:55 -07:00
Evan Tschannen 28545e0f8d multi cursors start a get more for the first 10 cursors to hide latency 2018-10-03 13:57:45 -07:00
Evan Tschannen aa51d69b2d fix: set peekLocality for upgraded tags 2018-10-03 13:54:59 -07:00
Evan Tschannen c9f4109539 fix: add some additional time in the kill region workload to detect if we recovered successfully 2018-10-02 17:47:15 -07:00
Evan Tschannen cdaf5e1192 fix: forced recovery does not recover tags from any DC besides the surviving one 2018-10-02 17:46:22 -07:00
Evan Tschannen 69711a107b fix: because of forced recovery, 0 log router tags does not mean we are a special tlog set 2018-10-02 17:45:11 -07:00
Evan Tschannen e7e1c634e0 fix: we need to restart the peek cursor when the known committed version becomes available 2018-10-02 17:44:14 -07:00
Evan Tschannen a92fc911ac do not spin on a failed storage server recruitment 2018-10-02 17:31:07 -07:00
Evan Tschannen 15ce215c1b fix: parallel peek requests leaked memory 2018-10-02 17:28:39 -07:00
A.J. Beamon 84c2e3567f Fix keys queried to use the RowsQueried metric instead of BytesQueried. 2018-10-01 11:19:28 -07:00
A.J. Beamon a98fcf5972 Rename durable_lag to durability_lag 2018-10-01 09:58:49 -07:00
Evan Tschannen bd6b743a81 fix: the storage server must always keep MAX_READ_TRANSACTION_LIFE_VERSIONS of history in memory, because forced recovery could roll back an epoch end.
fix: rollbacks were triggered unnecessarily
2018-09-28 16:04:59 -07:00
Evan Tschannen 3fdf72c626 fix: we need to force recovery if the master is still attempting to read the txs tag 2018-09-28 13:33:33 -07:00
Evan Tschannen 59335aa757 fix: the latest generation of remote transaction logs might has less data the a previous generation, because they take over at known committed version. Detect this case and end at the version that has the most data 2018-09-28 12:25:27 -07:00
Evan Tschannen c577840020 fix: forced recovery should remove all references to the old primary tlogs in all generations of logs to help the peek logic avoid attempting to read from them 2018-09-28 12:23:09 -07:00
Evan Tschannen 05e7f08b26 added a peek method which will attempt to read the txsTag from the local region as much as possible 2018-09-28 12:21:08 -07:00
Evan Tschannen a24eadd73a fix: for remote logs, their known committed version cannot be set to 1, because they can be used when their durable version is 0, leading to a known committed version being greater than a queue committed version 2018-09-28 12:17:21 -07:00
Evan Tschannen e64c55dce0 fix: data distribution would use the wrong priority sometimes when fixing an incomplete movement, this lead to the cluster thinking the data was replicated in all regions before it actually was 2018-09-28 12:15:23 -07:00
Evan Tschannen b1fe069165 fix: during forced recovery logs can be removed from the logSystemConfig. We need to avoid killing the removed logs as unneeded until we actually complete the recovery 2018-09-28 12:13:46 -07:00
Evan Tschannen 22e6afbb18 fix: the cluster controller did not pass in its own locality when creating its database object, therefore it was not using locality aware load balancing 2018-09-28 12:12:06 -07:00
Evan Tschannen b560b94ebc fix: do not force a recovery if the master was already in the other region (and therefore already recovered)
fix: reboot the remaining DC, because any storage server rejoins that were rolled back will cause that server to be unusable
2018-09-28 12:10:04 -07:00
A.J. Beamon f196e2d4dc Lot metrics about read requests as well as completed reads. 2018-09-27 15:32:39 -07:00
A.J. Beamon 118e21c446 Add new metrics for bytes queried, keys queried, mutation bytes, mutations, and durable lag to the storage role in status. 2018-09-27 14:33:21 -07:00
Steve Atherton 6756188f53
Merge pull request #760 from ajbeamon/fix-actor-warnings
Fix warnings about ACTORs not having waits. Fix shadowing of future v…
2018-09-24 10:07:59 -07:00
A.J. Beamon 48e620c680 Change the first of two trace events named "BTreeIntegrityCheck" to have the name "BTreeIntegrityCheckResults" 2018-09-24 08:40:18 -07:00
A.J. Beamon 92990d6aef Merge release-6.0 into master 2018-09-21 16:14:39 -07:00
Evan Tschannen 77e2fb787e Merge branch 'release-6.0' into feature-fix-forced-recovery 2018-09-21 14:55:37 -07:00
Evan Tschannen 3f86905ea7 fix: restore did not take into account that the end version of a log file does not exist in that file. This resulted in restores done at the same version a snapshot completes to not apply the mutations at that final version. 2018-09-21 11:48:28 -07:00
Evan Tschannen 6b6d7a087d The cluster controller should never consider itself as failed (that will be handled by the coordinators)
Simplified the check that the cluster controller is excluded
2018-09-20 17:01:11 -07:00
Evan Tschannen 31d0b0315f fix: tlog spill policy would spill everything when it wanted to spill nothing
use a flow lock to protect updatePersistData and initPersistentState from committing simultaneously
2018-09-20 15:33:38 -07:00
Evan Tschannen 03728db99b do not trigger better master exists if the cluster controller is excluded, since the master will change anyways once the cluster controller is moved 2018-09-19 18:28:24 -07:00
Evan Tschannen 861c8aa675 consider server health when building subsets of emergency teams 2018-09-19 17:57:01 -07:00
Evan Tschannen 702d018882 fix: we cannot use count on an async map, because someone waiting onChange for an item will cause it to exist in the map before it is set 2018-09-19 16:11:57 -07:00
Evan Tschannen 6d18193b3a fix: team->setHealthy was not being called correctly on initially unhealthy teams 2018-09-19 14:48:07 -07:00
Evan Tschannen 270b1b24a6 fix: we have to use durableKnownCommittedVersion, because the is the true lower bound on the recovery version of the remote logs
fixed a compiler error
2018-09-18 16:29:03 -07:00