Commit Graph

826 Commits

Author SHA1 Message Date
Evan Tschannen 8dfda1e57b fixed another trace event 2018-06-11 12:53:07 -07:00
Evan Tschannen e28769b98e fixed trace event name 2018-06-11 12:43:08 -07:00
Evan Tschannen 372ed67497 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
2018-06-11 11:34:10 -07:00
Evan Tschannen 588eaf4b36 fix: previous delay 0 could still cause us to recruit a tlog before processing disk errors 2018-06-11 11:26:30 -07:00
Evan Tschannen 64e0260085 fix: assert did not properly handle default constructed policies 2018-06-10 21:51:59 -07:00
Evan Tschannen b60264024a fix: we need to copy the txsTag on satellite logs 2018-06-10 20:30:44 -07:00
Evan Tschannen a5c2a8ee8a fix: allow disk errors to cancel the actor before recruiting logs 2018-06-10 20:27:19 -07:00
Evan Tschannen 134b5d6f65 fix: only consider data distribution started when remote has recovered so quite database works correctly 2018-06-10 20:25:15 -07:00
Evan Tschannen 2407e3774b fix: we cannot run with less storage replication than log replication because it breaks recruitment logic 2018-06-10 20:22:58 -07:00
Evan Tschannen 4903df5ce9 fix: give time to detect failed servers before building teams 2018-06-10 20:21:39 -07:00
Evan Tschannen 0bc7274d0e fix: hasSatelliteReplication was set incorrectly 2018-06-10 20:20:41 -07:00
Evan Tschannen 6e48d93d39 backed out the healthy team check because it was unnecessary 2018-06-10 12:43:32 -07:00
Evan Tschannen 8a24bf6124 describe did not list all the log sets 2018-06-10 12:38:50 -07:00
A.J. Beamon f965954122 Merge commit '82be52205b95464e355c449fdf3e7d483fa06677' into trace-log-refactor
# Conflicts:
#	fdbserver/Status.actor.cpp
#	fdbserver/workloads/DDMetrics.actor.cpp
#	flow/Trace.cpp
2018-06-08 16:22:22 -07:00
Evan Tschannen b9826dc1cb fix: do not automatically reduce redundancy we move keys if the database does not have remote replicas. This is to prevent problems when dropping remote replicas from a configuration. 2018-06-08 16:17:27 -07:00
Balachandar Namasivayam 8360f71cbb Merge branch 'master' of github.com:apple/foundationdb into save-fitness-info
# Conflicts:
#	fdbserver/worker.actor.cpp
2018-06-08 16:09:59 -07:00
Balachandar Namasivayam 32285ee958 Don't crash if fitness file is corrupted in real production use case. 2018-06-08 14:03:36 -07:00
A.J. Beamon 99c9958db7 Some more trace event normalization 2018-06-08 13:57:00 -07:00
A.J. Beamon 0ca51989bb Merge branch 'master' into trace-log-refactor
# Conflicts:
#	fdbserver/QuietDatabase.actor.cpp
#	fdbserver/Status.actor.cpp
#	flow/Trace.cpp
2018-06-08 13:24:30 -07:00
Evan Tschannen 50779a1860
Merge pull request #448 from bnamasivayam/fix-trprofile-test-bug
Having fixed limits for getRange results in continuously getting tran…
2018-06-08 12:52:50 -07:00
Balachandar Namasivayam 34995d4d64 Address review comments. 2018-06-08 11:51:51 -07:00
Balachandar Namasivayam 20febf5ef9 Address review comments. 2018-06-08 11:24:51 -07:00
A.J. Beamon e5488419cc Attempt to normalize trace events:
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.

Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.

This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
A.J. Beamon f1d389448c
Merge pull request #453 from apple/release-5.2
Merge release-5.2 into master
2018-06-08 10:41:44 -07:00
A.J. Beamon 6461478695
Merge pull request #452 from apple/release-5.1
Merge release-5.1 into release-5.2
2018-06-08 10:41:13 -07:00
Evan Tschannen 953c27e570
Merge pull request #431 from ajbeamon/tlog-rename-variables
Rename several variables in TLogServer.actor.cpp to follow our normal camel case conventions.
2018-06-08 10:30:22 -07:00
A.J. Beamon c9543791fd Fix case of newSeverity detail in StderrSeverity trace event 2018-06-08 10:24:12 -07:00
Evan Tschannen 7d392689fe fix: only update metrics for healthy destinations, because unhealthy destinations are already in the source 2018-06-07 18:12:04 -07:00
Evan Tschannen e4d5817679 fix: we must server getTeam requests before readyToStart is set because we cannot complete relocateShard requests without getTeam responses from both team collections 2018-06-07 16:14:40 -07:00
Balachandar Namasivayam 514b0e3c20 Having fixed limits for getRange results in continuously getting transaction_too_old error in some scenarios.
Cutting the limits by half in such cases allows to test to progress.
2018-06-07 15:27:05 -07:00
Evan Tschannen 9f0c16f062 do not build teams which contain failed servers 2018-06-07 14:05:53 -07:00
Balachandar Namasivayam 11b79c6c94 Save fitness info of a process to become a cluster controller. This info is currently lost after a reboot. Save this info and reload it to avoid unnecessary re-recruitments. 2018-06-07 13:07:19 -07:00
Evan Tschannen b423d73b42 fix: do not finish a shard relocation until all of the storage servers have made the current recovery version durable. This is to prevent dropping a needed storage server as a source for a shard after dropping a remote configuration 2018-06-07 12:29:25 -07:00
Evan Tschannen f26a2f771d fix: log router popped one too many versions from messageBlocks 2018-06-05 13:42:48 -07:00
Evan Tschannen be06938d9d fix: dropping the remote replication will cause all remote storage servers to die. Make sure we are not restoring redundancy before doing this to prevent data loss in simulation. 2018-06-04 18:46:09 -07:00
Evan Tschannen 6cf9508aae finished a comment 2018-06-03 19:38:51 -07:00
Evan Tschannen e95f663ebc fix: the log router could pop too much data from the logs in rare situations 2018-06-03 19:34:24 -07:00
Evan Tschannen bf65e745a9 tlogs do not index tags for other localities 2018-06-01 22:51:08 -07:00
Evan Tschannen c519339adb avoid peeking from logs that do not match the tag’s locality 2018-06-01 18:42:48 -07:00
Evan Tschannen ce6a2f0563
Merge pull request #425 from bnamasivayam/leader-election-optimize
Optimize client and server connection times to cluster controller, es…
2018-06-01 18:35:27 -07:00
Balachandar Namasivayam 59bfa74197 Address review comments. Refactor getLeader function to mask the first 7 bits of changeID and return the masked LeaderInfo. 2018-06-01 18:23:24 -07:00
Balachandar Namasivayam 529d0497f1 Proxy going OOM when applying high volumes of writes to a proxy, particular in a sudden fashion before ratekeeper can control the workload.
Address this issue by proactively monitoring the memory used by commit batches and dropping requests if a certain memory limit is exceeded.
2018-06-01 15:21:40 -07:00
A.J. Beamon 1f0b519a73 Rename several variables in TLogServer.actor.cpp to follow our normal camel case conventions. I didn't rename every variable here, because some appear to be data structures (like a map) following the pattern keydesc_valuedesc, and I wasn't sure that the straightforward keydescValuedesc rename made sense. I did rename a couple of instances of these where it seemed reasonable, though. 2018-06-01 10:18:07 -07:00
Balachandar Namasivayam 9f55ccd4a5 Remove extraneous comments. 2018-05-31 15:32:47 -07:00
A.J. Beamon 78839b20fd Merge branch 'master' into trace-log-refactor
# Conflicts:
#	flow/Trace.cpp
2018-05-31 10:46:20 -07:00
Balachandar Namasivayam 070366ca70 Optimize client and server connection times to cluster controller, especially in multi DC configurations.
A majority(quorum) answer from co-ordinators was required to connect to cluster controller.
Now a cluster controller is optimistically selected to connect even if there is no quorum.
2018-05-30 16:48:04 -07:00
A.J. Beamon d9c702a9e3 Merge release-5.1 into release-5.2 2018-05-30 09:09:55 -07:00
Evan Tschannen 0e699a3c23 fix: ratekeeper should only control on local logs 2018-05-29 10:51:23 -07:00
A.J. Beamon 026458baf3 Merge release-5.2 into master 2018-05-23 15:32:56 -07:00
A.J. Beamon e538fb4065 Add error description to error output when networking could not be initialized. 2018-05-23 15:05:28 -07:00
Alec Grieser 40babc40e1
remove one unnecessary line ; fix else formatting 2018-05-15 17:20:44 -07:00
Alec Grieser 6d132717f2
add versionstamp compatibility test to VersionStampWorkload
surfaces error found in #387
2018-05-15 17:09:24 -07:00
Dennis Schafroth a9f54e1865 Compile on macOS 10.13.4: Use ASSERT_ABORT in destructors. Import fstream 2018-05-15 12:55:02 -07:00
A.J. Beamon 02df30149f Merge branch 'release-5.2' into trace-log-refactor 2018-05-11 11:22:34 -07:00
Evan Tschannen 91338fc984 Merge branch 'master' into feature-remote-logs 2018-05-10 15:33:45 -07:00
Evan Tschannen 8f984cb2c9 Merge branch 'release-5.2'
# Conflicts:
#	fdbrpc/TLSConnection.h
2018-05-10 09:13:22 -07:00
Evan Tschannen d3450ce5b0
Merge pull request #343 from bnamasivayam/tls-plugin
Tls plugin
2018-05-09 16:35:53 -07:00
Evan Tschannen f6e55d0b74
Merge pull request #348 from etschannen/release-5.2
DR upgrade tests now test the durability of the data.
2018-05-09 15:40:03 -07:00
Evan Tschannen 8930c2e3db DR upgrade tests now test the durability of the data. 2018-05-09 15:11:05 -07:00
Balachandar Namasivayam 7591931a09 Revert "Make tls_verify_peers as a comma separated string of constraints."
This reverts commit 2033847e4b.
2018-05-09 14:40:36 -07:00
Balachandar Namasivayam 2033847e4b Make tls_verify_peers as a comma separated string of constraints. 2018-05-09 14:37:39 -07:00
Alec Grieser f3093642b3
Merge pull request #242 from alecgrieser/32437306-better-versionstamped-value
Unify SET_VERSIONSTAMPED_KEY and SET_VERSIONSTAMPED_VALUE API
2018-05-09 09:04:07 -07:00
Balachandar Namasivayam e8b7f4b190 Add password support for tls. 2018-05-08 20:46:31 -07:00
Balachandar Namasivayam 49af5d685b Restore previous behavior of not specifying peer_verify option means disable checking. 2018-05-08 18:54:44 -07:00
Balachandar Namasivayam d3b5cfb93c Support latest TLS plugin.
Add support for https in backup.
2018-05-08 16:28:13 -07:00
A.J. Beamon 54b4c9e061 Merge branch 'release-5.2' into trace-log-refactor
# Conflicts:
#	fdbserver/Status.actor.cpp
2018-05-08 15:51:54 -07:00
Evan Tschannen 9f0d244efe Merge branch 'master' into feature-remote-logs 2018-05-08 13:28:23 -07:00
Evan Tschannen 7acdc314e4 Merge branch 'release-5.2'
# Conflicts:
#	fdbrpc/TLSConnection.actor.cpp
2018-05-08 13:22:53 -07:00
Evan Tschannen 1f6c6a886b Merge branch 'release-5.1' into release-5.2 2018-05-08 13:08:11 -07:00
A.J. Beamon ca720e1540
Merge pull request #297 from apple/release-5.2
Merge 5.2 to Master
2018-05-08 12:04:20 -07:00
Alec Grieser 47c9e4f923
update bindings and bindingtester that uses versionstamps to use new protocol
issue #148
2018-05-08 08:57:09 -07:00
Alec Grieser 464e2cdbf0
change SetVersionstampedKey and SetVersionstampedValue behavior based on API version to make them consistent 2018-05-08 08:57:09 -07:00
Alec Grieser 14cca75429
server components of version of alternative versionstamp op that writes to an arbitrary place in the value 2018-05-08 08:57:08 -07:00
Evan Tschannen e8f6ad88f0 fix: tripled the smallStorageTarget to prevent simulations which do a lot of work from timing out 2018-05-07 17:26:44 -07:00
Alec Grieser 752deb07a1
fix fdbmonitor help message output ; fix spelling error Ratekeeper.actor.cpp 2018-05-07 16:19:50 -07:00
Evan Tschannen 4677789b38 fix: low latency tests need 4 machines per datacenter to support triple replication after 1 machine has failed 2018-05-07 11:28:25 -07:00
Evan Tschannen 529bd34cf9 fix: when a tlog is stopped by another recruitment it no longer has the opportunity for commtingQueue to be set 2018-05-06 20:37:44 -07:00
Evan Tschannen 81c7bddaf8 fix: must check for log router errors while waiting on satellite replies because the recruitmentID will not be updated if it threw an error 2018-05-06 18:15:12 -07:00
Evan Tschannen 8cb8198250 fix: the e-brake should be buggified with ratekeeper storage limits to prevent simulation from running full blast into the e-brake resulting in simulation taking forever to complete (joshua timeouts) 2018-05-06 12:33:25 -07:00
Evan Tschannen cc6511a39e fix: we do not know that the minimum popped version on the log router is a known committed version until it has advanced. 2018-05-06 09:32:41 -07:00
Evan Tschannen b1935f1738 fix: do not allow a storage server to be removed within 5 million versions of it being added, because if a storage server is added and removed within the known committed version and recovery version, they storage server will need see either the add or remove when it peeks 2018-05-05 18:16:28 -07:00
Evan Tschannen 8371afb565 fix: log routers need to know if the log system is stopped to determine how they should peek the last log generation 2018-05-05 17:56:00 -07:00
Evan Tschannen 7ed64c821e fix: recruiting a cluster controller takes longer after restarting tests because we wait until files have recovered from disk before starting 2018-05-05 17:20:48 -07:00
Evan Tschannen e8ea02e054 fix: storage servers need to fail if they can no longer peek data 2018-05-05 17:19:59 -07:00
A.J. Beamon 432a295bc2 Add read bytes and read keys info to status. Collect this information directly from StorageMetrics rather than through ratekeeper. 2018-05-04 12:01:40 -07:00
A.J. Beamon ce0c991e78 Refactor trace events to store a vector of fields that aren't encoded until write time. Better support for pre-network trace events. Rework how trace events are queried. Some initial work towards pluggable formatting of logs. 2018-05-02 10:44:38 -07:00
Evan Tschannen 440e2ae609 fix: data distribution logic was incorrect for finding a complete source team in a failed DC 2018-05-01 23:08:31 -07:00
Evan Tschannen 87ad03ce53 locality aware load balancing was disabled on the storage servers because emergency teams might cause a server to be assigned a shard when it does not actually have the data. This problem has been fixed, so we can re-enable locality aware load balancing. 2018-05-01 22:45:22 -07:00
Evan Tschannen b4bd03e67e fix: we cannot set queueCommitEnd until we have popped the log system to prevent the popped version from going backwards 2018-05-01 22:20:25 -07:00
Evan Tschannen 12ef63b698 knobify replace contents bytes 2018-05-01 19:43:35 -07:00
Evan Tschannen 656a817e74 fix: only reconfigure during the quiet database check, because excluding at the same time as reconfiguring causes the master to indefinitely restart recovery 2018-05-01 15:31:49 -07:00
Evan Tschannen c3f2e2bb38 fix: do not attempt to become the cluster controller before recovering files from disk 2018-05-01 12:05:43 -07:00
Evan Tschannen e27531d39e Merge branch 'master' into feature-remote-logs 2018-04-30 22:55:46 -07:00
Evan Tschannen 10d25927cd Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
2018-04-30 22:15:39 -07:00
Evan Tschannen eded5631e6 fix: epoch end was already known committed version + 1, and did not need an additional + 1. 2018-04-30 22:03:11 -07:00
Evan Tschannen e1e43cff28 endEpoch implemented using getDurableVersion 2018-04-30 18:32:04 -07:00
Alex Miller bc8e6acbe8 Fix the other half of simulation requiring a TLS Plugin.
This commit:
1. Restores --tls_plugin as a way to provide the path to the TLS plugin when running in simulation.
2. Removes the TLS Plugin as being required for 5% of tests.
3. Standardizes on 'sslEnabled' as a variable name.

And is a fix/improvement upon commit f7733d1b.

(1) previously didn't work, because we would create multiple new TLSOptions
instances and run init_plugin multiple times.  Only the first call would use
the argument specified on the command line.  To fix this, the TLSOptions
derived from the command line is threaded through all the simulation code that
needs it.

(2) was an oversight in f7733d1b, which didn't actually make "should we be TLS"
dependant on if the TLS plugin was available or not.

(3) is just nice for trying to grep around in the codebase.
2018-04-30 18:26:29 -07:00
Evan Tschannen 5143871fed passed debug ids into all versions of peek() to assist debugging 2018-04-30 13:36:35 -07:00
Evan Tschannen 883f2318a0 test fearless configurations 2018-04-30 13:17:29 -07:00
Evan Tschannen 99598d180b fix: the log router must be initialized with all expected tags to prevent mistakenly choosing a minPopped that is too high 2018-04-30 10:58:41 -07:00
Evan Tschannen 92b134eb98 fix: errors from removed were not handled properly 2018-04-29 23:05:08 -07:00
Evan Tschannen 6f318dbff2 fix: do not reply to recruitment until we are sure the log commits to the queue 2018-04-29 22:08:24 -07:00
Evan Tschannen 9cdabfed0e added useful trace events 2018-04-29 18:54:47 -07:00
Evan Tschannen 2e286b768d fix: locality is needed for a logSet to call getPushLocations
fix: accidentally deleted allowPops assignment on the log router
2018-04-29 13:47:32 -07:00
Evan Tschannen dbdeeaa5cf fix: log routers are given all the information they need to add remote tags in their initialization request 2018-04-28 18:04:57 -07:00
Alec Grieser 69e831d522
Merge remote-tracking branch 'upstream/release-5.2' into merge-release-5.2 2018-04-28 17:44:52 -07:00
Evan Tschannen 33fa8f2cac fix: make sure log routers only add remote tags from the correct log set 2018-04-28 15:04:13 -07:00
Evan Tschannen f77c1ec14e fix: fixed rare bug where a log stopped by a different recruitment would still response successfully to the recruitment message 2018-04-28 13:34:06 -07:00
Evan Tschannen 23c0249d80 fix: old log routers tags must be available at the best location in the new generation 2018-04-28 11:13:10 -07:00
Alec Grieser a1faaafca3
Merge remote-tracking branch 'upstream/release-5.1' into merge-release-5.1 2018-04-27 16:38:18 -07:00
Yichi Chiang c721ab6854 Fix review comments 2018-04-27 13:54:34 -07:00
Evan Tschannen af63dac5dd fix: remote logs need to wait until the durable known committed version is greater than the recovery version before completing recovery to ensure we will not pick a start version that we do not have 2018-04-27 12:18:42 -07:00
Evan Tschannen 32e9ea3bb4 fix: recruited the wrong number of log routers 2018-04-26 22:22:15 -07:00
Evan Tschannen d72087bfd3 fix: we may not be able to recruit enough log routers, in this case put multiple log routers on the same worker, but also properly rank this configuration lower in better master exists 2018-04-26 22:18:07 -07:00
Evan Tschannen a12b994966 fix: log routers need tlogs to be present before accepting data 2018-04-26 18:37:51 -07:00
Evan Tschannen abcfb0604a fix: cloneNoMore needs to pass useBestSet 2018-04-26 18:32:12 -07:00
Yichi Chiang 6bddf8aefa Upgrade DR from 5.1 to 5.2 2018-04-26 17:24:40 -07:00
Evan Tschannen c7fd85243b fix: passed the wrong argument value 2018-04-26 13:25:27 -07:00
Evan Tschannen 0dd6931223 fix: remote recruitment must still wait for old log routers to be recruited since they are not needed by the newly recruited logs to finish recovery 2018-04-26 12:55:28 -07:00
Evan Tschannen 721aaa2a6b fix: we need to monitor old log routers for failures before recovery is complete
fix: after configuring out of fearless remote logs will not have all the data until the new configuration
2018-04-26 10:59:21 -07:00
Evan Tschannen a2b62e15ea fix: only peek to peekEnd() 2018-04-25 19:56:50 -07:00
Evan Tschannen 7e434348ce fix: storage servers did not properly pull data when configuring from a fearless setup to a non-fearless setup 2018-04-25 18:20:28 -07:00
Evan Tschannen fa9089c2e8 fix: removed storage servers must be popped on remote logs from the proxy 2018-04-25 15:38:34 -07:00
Evan Tschannen 471e7b9ab9 fix: update the logSystem on the proxies so that they can pop the txs tag from remote logs 2018-04-25 10:16:31 -07:00
Evan Tschannen 4119a1c5d5 do not add cursors for log sets that have no data 2018-04-24 22:06:10 -07:00
Evan Tschannen 95855dbfc4 correctly filter locality data 2018-04-24 18:14:34 -07:00
Alex Miller f7733d1bd0 Do not require the TLS Plugin for simulation.
It appears that explicit calls to TLS-related things had snuck in over time,
which meant that simulation runs that weren't even configured to use SSL still
wanted and required the TLS plugin.

This commit instead threads through the understanding of if any TLS-related
options were provided, and if not, then don't call anything TLS-related so that
we don't require the TLS plugin.

Hopefully this makes life easier for the opensource folk. :)
2018-04-24 16:53:30 -07:00
Evan Tschannen 35b2ca820a fix: certain tlog errors during remote recovery could fail to kill the master, the master could have a reference counting cycle with its actor collection 2018-04-24 16:10:14 -07:00
Evan Tschannen 1cfe1cb7f0 fix: do not let the storage server process an exhausted version, because it could prevent a rollback 2018-04-23 22:03:55 -07:00
Evan Tschannen ae1de575f1 fix: remote logs are not considered fully recovered until they are at recoveredAt 2018-04-23 17:49:46 -07:00
Evan Tschannen 3ec09ce9f6 fix: only peekSingle needs to throw worker_removed, because tlogs have other ways to get notified they are no longer needed
fix: we need to wait until tags are popped past recoveredAt instead of unrecovered before
2018-04-23 16:43:08 -07:00
Evan Tschannen 126fc53d10 fix: the start version for peek cursors that merge with multiple log sets is the maximum of the individual start versions 2018-04-23 12:42:51 -07:00
tracebundy dd36f55a90
Update fdbserver.actor.cpp
fix the bug 'fdbserver/fdbserver.actor.cpp:761:16: error: aggregate ‘std::ifstream ifs’ has incomplete type and cannot be defined'
2018-04-23 10:06:15 -07:00
Dennis Schafroth 290122637b Using ASSERT_ABORT in destructors 2018-04-23 14:05:10 +02:00
Evan Tschannen 73597f190e fix: new tlogs are initialized with exactly the tags which existed at the recovery version 2018-04-22 20:28:01 -07:00
Evan Tschannen a520d03397 fix: if we cannot find a tag, it must have been popped at the recovery version. 2018-04-22 15:08:38 -07:00
Evan Tschannen ef23136809 fix: ensure the logSystemConfig is updated with newly recruited log routers 2018-04-22 11:54:39 -07:00
Evan Tschannen fceec020de fix: use the known committed version if the last generation primary logs were in the same data center as this generation
the known committed version in end epoch is the maximum seen in all responses regardless of log set
2018-04-22 11:14:13 -07:00
Evan Tschannen c3a344d44e fix: do not choose a remote start version past the start of the locked logs 2018-04-21 16:03:28 -07:00
Evan Tschannen 28a1fa9dc2 fix: we need to notify the old log system that its recruitmentID has changed 2018-04-21 12:57:00 -07:00
Evan Tschannen 1d1e2cd367 fix: initialize the known committed version on the tlog 2018-04-21 00:41:15 -07:00
Evan Tschannen 8d350ceb5f fix: persist the known committed version on the tlogs 2018-04-20 17:55:46 -07:00
Evan Tschannen a6d9e889f0 a cleaner solution to preventing tlogs from peeking log routers 2018-04-20 13:25:22 -07:00
Evan Tschannen f5c3417905 fix: prevent tlogs from peeking the wrong log routers 2018-04-20 00:30:37 -07:00
Evan Tschannen 5da452db8e fix: pop the log routers again after the log system updates 2018-04-19 14:33:31 -07:00
Bruce Mitchener 2f8a0240f1
Fix some typos. 2018-04-19 11:44:01 -07:00
Bruce Mitchener 9cdf25eda3 Fix some typos. 2018-04-20 00:49:22 +07:00
Evan Tschannen d46d5487bd Merge branch 'release-5.2' 2018-04-18 20:46:03 -07:00
Evan Tschannen 57d650062a merge 5.1 into 5.2 2018-04-18 20:44:31 -07:00
Evan Tschannen 224621be04 fix: extraDB==0 must leave g_simulator.extraDB as null, so that non-DR tests do not attempt to use a DR database 2018-04-18 19:34:35 -07:00
Evan Tschannen 22526ef996 fix: do not tell storage servers about large sections of empty versions, because it can lead them to make mutations durable which have not been committed 2018-04-18 16:06:44 -07:00
Evan Tschannen 447c7bd15b fix: log routers use durable known committed version at the time of the pop to determine what is safe to pop from their logs
fix: storage server does not advance its version across large version increase until it has data associated with the version
2018-04-18 12:07:29 -07:00
Evan Tschannen e43fb6d8bc fix: the log routers were popping too many versions because the known committed version is less than minPopped version 2018-04-17 19:41:36 -07:00
Evan Tschannen c1ccc8522c Merge branch 'release-5.2' 2018-04-17 18:38:12 -07:00
Evan Tschannen db98c1b9b6 Merge branch 'release-5.1' into release-5.2
# Conflicts:
#	versions.target
2018-04-17 18:36:19 -07:00
Evan Tschannen 8569a85771 fix: only let a log router pop if they tlog it is serving is fully recovered 2018-04-17 15:03:22 -07:00
Evan Tschannen 760bc8bc99 fix: log router version needs to be fetched before it is available
fix: tlog did not fetch known committed version if start version was exactly equal to it
2018-04-17 11:16:48 -07:00
Evan Tschannen 093908b83f fix: log routers were starting one version too late 2018-04-17 00:29:16 -07:00
Evan Tschannen 3e40505f4a Revert "fix: remote logs should reply until they have recovered through recoverAt"
This reverts commit 3c0c03c004.
2018-04-16 23:17:16 -07:00
Evan Tschannen 3c0c03c004 fix: remote logs should reply until they have recovered through recoverAt 2018-04-16 17:25:49 -07:00
Evan Tschannen cef6c9b418 fix: the startVersion cannot be larger than the known committed version 2018-04-16 16:21:27 -07:00
Evan Tschannen dcfa1847ff fix: log router’s starting popped version must be less than its starting version 2018-04-16 11:43:03 -07:00
Evan Tschannen 3018a7b1b3 fix: the known committed version of a newly initialized log is 1, since by definition the first commit must have succeeded 2018-04-16 10:42:48 -07:00
Evan Tschannen a8662f8737 fix: remote recovered is does not need to wait for old logs to be removed 2018-04-16 10:14:39 -07:00
Evan Tschannen e53f17a83a fix: the newest log router needs to start where the last old one ends 2018-04-15 14:54:22 -07:00
Evan Tschannen 5533016f1e fix: tlogs are now initialized immediately, instead of when starting the core, this must be done to pop the log routers during recovery
fix: log router start version must be the same as remote log start version
2018-04-15 14:33:07 -07:00
Evan Tschannen 0496bee1ef fix: suppress expected errors in data distribution 2018-04-15 11:30:22 -07:00
Evan Tschannen 041f5787fb fix: peekLocal does not stop when a locality does not exist
fix: lock logs only stops on special or upgraded locality
fix: recruiting old log routers respects the passed in startVersion
2018-04-14 19:06:24 -07:00
Evan Tschannen f5141acae9 fix: log routers need all logs present in their log system since they call addRemoteTags 2018-04-13 17:33:36 -07:00
Evan Tschannen 65e69620a7 fix: unrecoveredBefore on a new log is at minimum 1 2018-04-13 10:41:30 -07:00
Yichi Chiang a4e8b6492c Fix DR Upgrade workload backup range 2018-04-13 09:59:32 -07:00
Evan Tschannen c589630e53 fix: log router start version is based on the start version of the local logs 2018-04-12 18:14:23 -07:00
Evan Tschannen 3b7e4410cf fix: protect from peeking too early of a version from a log router 2018-04-12 16:15:17 -07:00
Evan Tschannen 1af5ac0d9d fix: a number of different problems prevented tlogs from using log routers during recovery 2018-04-12 15:20:54 -07:00
Evan Tschannen c6229e443c fix: do not use resolution class when using regions 2018-04-11 21:22:53 -07:00
Evan Tschannen 4248fbec61 fix: must set startVersion when upgrading 2018-04-11 17:33:17 -07:00
Evan Tschannen 19762b847d Merge branch 'release-5.2'
# Conflicts:
#	fdbserver/DatabaseConfiguration.cpp
#	fdbserver/SimulatedCluster.actor.cpp
2018-04-10 17:02:43 -07:00
Evan Tschannen c1ba16b3c8 Merge branch 'release-5.1' into release-5.2
# Conflicts:
#	bindings/java/src/test/com/apple/foundationdb/test/AbstractTester.java
#	bindings/java/src/test/com/apple/foundationdb/test/VersionstampSmokeTest.java
#	bindings/nodejs/lib/fdb.js
#	bindings/nodejs/src/Version.h
#	bindings/nodejs/tests/tuple_test.js
2018-04-10 16:50:47 -07:00
Evan Tschannen b0a88001cc
Merge pull request #132 from yichic/support-dr-upgrade-test
Support DR upgrade test
2018-04-10 16:30:19 -07:00
Evan Tschannen b46c32535c surpassed spammy trace events 2018-04-10 15:52:32 -07:00
Yichi Chiang d0230d4d13 Support DR upgrade test in 5.1 2018-04-10 15:19:53 -07:00
Alex Miller b289312a37
Merge pull request #120 from alecgrieser/storage-class-help-text
Add router to help text for storage class of fdbserver
2018-04-10 15:01:27 -07:00
Evan Tschannen 3453a51d0f remoteRecovery was still swallowing errors 2018-04-10 13:31:24 -07:00
Evan Tschannen 5fcedd2e98 fix: coordinated state errors were being eaten 2018-04-10 11:14:57 -07:00
Evan Tschannen 2ab2c788b3 fix: the start version is allowed to be larger than the recovery version 2018-04-09 21:58:14 -07:00
Evan Tschannen a738c4bec1 fix: if the known committed version is equal to the recovery version we do not need to copy any data 2018-04-09 20:48:55 -07:00
Evan Tschannen 419951f601 fix: need to initialize tlog versions to less than the startVersion 2018-04-09 17:17:11 -07:00
Evan Tschannen 27e14790b1 fix: do not start at a version larger that the recovery version 2018-04-09 15:08:01 -07:00
Evan Tschannen 7566a0d109 fix: endEpoch gets its logs from the core state, so by definition they are written 2018-04-09 11:44:54 -07:00
Evan Tschannen 4c89f721cd fix: do not include logRouter tags in lock results 2018-04-09 10:48:57 -07:00
Evan Tschannen 7af892f50b first working version of non-copying recovery working with fearless configurations 2018-04-08 21:24:05 -07:00
Alex Miller 0136a01c18 Fix "Not enough physical servers available" error due to incorrect server calculation. 2018-04-05 15:13:21 -07:00
Evan Tschannen bc938d9273 fix: storage recruitment could get stuck in a spin loop 2018-04-03 18:06:31 -07:00
Evan Tschannen 331e707684 fix: pop all tags that did not have data at the recovery version because fully popped tags may come back when pullAsyncData re-indexes the mutations 2018-03-31 16:47:56 -07:00
Evan Tschannen 96fffe2cea fix: do not update version if the log has been stopped 2018-03-30 22:11:42 -07:00
Evan Tschannen 4fb2b99341 fix: using only one region still means we need 3 machines per datacenter, the other machines in the other datacenters just won’t be used 2018-03-30 19:26:22 -07:00
Evan Tschannen 579ba58930 pop old tags only looks are recovered tags, and checks if they are still being used 2018-03-30 19:08:01 -07:00
Evan Tschannen 8352b93f48 fix: do not reuse tags that are still in historyTags, pop historyTags past epochEnd to allow tlogs to finish recovery
fix: peekLocal did not properly respect end
fix: the storage server added to the end of the history vector instead of the beginning
2018-03-30 17:39:45 -07:00
Evan Tschannen 43cb63df25 fix: the collectTags bool was set incorrectly 2018-03-29 18:19:29 -07:00
Evan Tschannen 1a4ded1c99 support upgrades by merging tags associated with the different peek requests 2018-03-29 17:54:08 -07:00
Evan Tschannen b36e08f08f first version of non-copying recovery. Upgrades are broken, and it has not been tested using fearless configurations yet 2018-03-29 15:12:38 -07:00
Evan Tschannen da737e1ea3 suppress the BestTeamStuck trace event 2018-03-26 18:32:32 -07:00
Evan Tschannen 82ed956c65 renamed the multi_dc configuration to three_datacenter. The old three_datacenter configuration was not a useful configuration. 2018-03-26 18:31:26 -07:00
Evan Tschannen b95e68eb5a fix: getDatabaseSize is really inefficient and causes slow tasks in the real world. Outside of simulation just assume the database is really large, because we only need the InvalidShardSize check in simulation 2018-03-26 17:35:11 -07:00
Alec Grieser bb5f3ebb6d
add router to help text for storage class of fdbserver 2018-03-26 13:26:56 -07:00
Evan Tschannen d3fb17d30a
Merge pull request #74 from bnamasivayam/client-profiling-tests
Client profiling tests - Part 1
2018-03-23 16:52:49 -07:00
Balachandar Namasivayam 1e719d79e9 Remove incorrect ASSERT's
Account for corner cases in missing chunks.
2018-03-23 15:51:56 -07:00
Evan Tschannen 5db52ab081
Merge pull request #87 from etschannen/feature-remote-logs
Feature remote logs
2018-03-23 12:55:17 -07:00
Evan Tschannen 7c48e1d31c
Update SimulatedCluster.actor.cpp 2018-03-23 12:54:44 -07:00
A.J. Beamon ddc0c613ed
Merge pull request #109 from apple/release-5.2
Merge Release 5.2 into master
2018-03-21 09:37:56 -07:00
Clement Pang 64deb0e0a1 Address review comments. 2018-03-20 14:38:04 -07:00
Clement Pang b46ffb4cbc Available space should take into account both memory and disk 2018-03-20 14:38:04 -07:00
Evan Tschannen 0746fe4d56 optimized tag lookups on the tlog by removing one level of vectors 2018-03-20 10:41:42 -07:00
Evan Tschannen d8e064d8bb fix: when a new log is recruited on a shared log, all outstanding commits need to be notified that they are stopped, because there is no longer a guarantee that their queueCommittedVersion will advance 2018-03-19 17:48:28 -07:00
Alec Grieser 551ea9c7f8
Merge remote-tracking branch 'upstream/release-5.2' into master-release-5.2-merge 2018-03-19 12:34:50 -07:00
yichic ede5cab192
Merge pull request #89 from yichic/share-log-mutations-5.2
Share log mutations 5.2
2018-03-19 12:01:26 -07:00
Yichi Chiang 1f2602d2b3 Fix all review comments 2018-03-19 11:33:33 -07:00
Yichi Chiang d6559b144f Share log mutations between backups and DRs which have the same backup range 2018-03-19 11:32:50 -07:00
Evan Tschannen 54be14000d do not deserialize tags 2018-03-17 11:24:18 -07:00
Evan Tschannen 4dcef08260 optimized the log router to use a vector instead of a map for tag data 2018-03-17 11:08:37 -07:00
Evan Tschannen 9c8cb445d6 optimized the tlog to use a vector for tags instead of a map 2018-03-17 10:36:19 -07:00
Evan Tschannen fecfea0f7d fix: messages vector was not cleared 2018-03-17 10:24:44 -07:00
Balachandar Namasivayam 9e3e3c8561 Add some sanity checks to deserialized data. 2018-03-16 18:45:25 -07:00
Yichi Chiang f12c1d811c Fix all review comments 2018-03-16 18:09:23 -07:00
Yichi Chiang 26b93ff920 Share log mutations between backups and DRs which have the same backup range 2018-03-16 18:09:23 -07:00
Evan Tschannen ccd70fd005 The tlog uses the tags embedded in the message instead of a separate vector of locations
optimized remote tlog committing to avoid re-serializing the message
2018-03-16 16:47:05 -07:00
Evan Tschannen 820382ea68 optimized the log router commit path to avoid re-serializing the data 2018-03-16 11:40:21 -07:00
Evan Tschannen a42205eb8e test running with only one region 2018-03-15 15:40:58 -07:00
Balachandar Namasivayam 89d7cc1093 Minor Bug fixes... 2018-03-15 11:00:47 -07:00
Evan Tschannen 82fb6424ec fix: storage recruitment could get stuck in a spin loop 2018-03-15 11:00:44 -07:00
Evan Tschannen 65b532658f added support for single region configurations 2018-03-15 10:59:30 -07:00
Alec Grieser 0853fcb052
switch to using zu for some size_t variables in printf 2018-03-14 18:07:05 -07:00
Evan Tschannen 59723f51f8 fix: continue to attempt to lock logs until remote logs are recovered, this is so that remote logs get locked and readers know they will not have any more data
do not throttle trace events in simulation
2018-03-14 12:39:55 -07:00
Balachandar Namasivayam 856d2a0a9d Add correctness tests for Client transaction profiling data format. It also includes format check across upgrades. 2018-03-14 12:39:50 -07:00
Alec Grieser 70a05c1a9b
fix some compiler whinges 2018-03-13 15:00:16 -07:00
Evan Tschannen 2e741057d4 use references instead of copying regionInfo 2018-03-13 12:59:07 -07:00
Evan Tschannen f6a22c1035 fix: the recovery actor was holding a copy of the tlogInterface after the tlog was removed 2018-03-12 16:56:34 -07:00
Evan Tschannen 72d56a700c fix: do not serialize an a tlog interface without a unique id 2018-03-10 09:52:09 -08:00
Evan Tschannen c74211bd92 fix: merge problem 2018-03-09 16:52:37 -08:00
Evan Tschannen 3abf4d7fdf Merge branch 'master' into feature-remote-logs 2018-03-09 14:50:04 -08:00
Evan Tschannen 91bb8faa45 Merge commit 'f773b9460d31d31b7d421860fc647936f31aa1fa'
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-03-09 14:47:03 -08:00
Evan Tschannen 28ea983487 Merge branch 'release-5.1' into release-5.2
# Conflicts:
#	flow/Trace.cpp
#	versions.target
2018-03-09 14:40:31 -08:00
A.J. Beamon bb9f51bb5c Don't try to extract attributes from the program start trace events if they couldn't be collected. 2018-03-09 11:55:57 -08:00
Evan Tschannen cf6dd1437b suppress spammy trace events 2018-03-09 10:16:34 -08:00
Evan Tschannen ae7d8e90b2 Merge branch 'release-5.1' of github.com:apple/foundationdb into release-5.1 2018-03-09 09:56:09 -08:00
Evan Tschannen 5390af8be4 suppress spammy logs 2018-03-09 09:40:36 -08:00
A.J. Beamon 1bf9f0ec6b
Merge pull request #54 from etschannen/release-5.1
fix: new cluster controllers should not consider anything failed unti…
2018-03-09 09:28:21 -08:00
Evan Tschannen f9625f5b2f fix: new cluster controllers should not consider anything failed until they have time to get failure monitoring updates
fix: storage and log class machines wait 100MS before attempting to become the cluster controller
2018-03-08 18:08:41 -08:00
Balachandar Namasivayam e7309a3535 Add trace events to print the ranges in ConsistencyCheck. 2018-03-08 13:53:59 -08:00
Evan Tschannen cf9d02cdbd
Merge pull request #48 from apple/release-5.2
Merge release-5.2 into master
2018-03-08 13:21:26 -08:00