Evan Tschannen
8dfda1e57b
fixed another trace event
2018-06-11 12:53:07 -07:00
Evan Tschannen
e28769b98e
fixed trace event name
2018-06-11 12:43:08 -07:00
Evan Tschannen
372ed67497
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/TagPartitionedLogSystem.actor.cpp
2018-06-11 11:34:10 -07:00
Evan Tschannen
588eaf4b36
fix: previous delay 0 could still cause us to recruit a tlog before processing disk errors
2018-06-11 11:26:30 -07:00
Evan Tschannen
64e0260085
fix: assert did not properly handle default constructed policies
2018-06-10 21:51:59 -07:00
Evan Tschannen
b60264024a
fix: we need to copy the txsTag on satellite logs
2018-06-10 20:30:44 -07:00
Evan Tschannen
a5c2a8ee8a
fix: allow disk errors to cancel the actor before recruiting logs
2018-06-10 20:27:19 -07:00
Evan Tschannen
134b5d6f65
fix: only consider data distribution started when remote has recovered so quite database works correctly
2018-06-10 20:25:15 -07:00
Evan Tschannen
2407e3774b
fix: we cannot run with less storage replication than log replication because it breaks recruitment logic
2018-06-10 20:22:58 -07:00
Evan Tschannen
4903df5ce9
fix: give time to detect failed servers before building teams
2018-06-10 20:21:39 -07:00
Evan Tschannen
0bc7274d0e
fix: hasSatelliteReplication was set incorrectly
2018-06-10 20:20:41 -07:00
Evan Tschannen
6e48d93d39
backed out the healthy team check because it was unnecessary
2018-06-10 12:43:32 -07:00
Evan Tschannen
8a24bf6124
describe did not list all the log sets
2018-06-10 12:38:50 -07:00
A.J. Beamon
f965954122
Merge commit '82be52205b95464e355c449fdf3e7d483fa06677' into trace-log-refactor
...
# Conflicts:
# fdbserver/Status.actor.cpp
# fdbserver/workloads/DDMetrics.actor.cpp
# flow/Trace.cpp
2018-06-08 16:22:22 -07:00
Evan Tschannen
b9826dc1cb
fix: do not automatically reduce redundancy we move keys if the database does not have remote replicas. This is to prevent problems when dropping remote replicas from a configuration.
2018-06-08 16:17:27 -07:00
Balachandar Namasivayam
8360f71cbb
Merge branch 'master' of github.com:apple/foundationdb into save-fitness-info
...
# Conflicts:
# fdbserver/worker.actor.cpp
2018-06-08 16:09:59 -07:00
Balachandar Namasivayam
32285ee958
Don't crash if fitness file is corrupted in real production use case.
2018-06-08 14:03:36 -07:00
A.J. Beamon
99c9958db7
Some more trace event normalization
2018-06-08 13:57:00 -07:00
A.J. Beamon
0ca51989bb
Merge branch 'master' into trace-log-refactor
...
# Conflicts:
# fdbserver/QuietDatabase.actor.cpp
# fdbserver/Status.actor.cpp
# flow/Trace.cpp
2018-06-08 13:24:30 -07:00
Evan Tschannen
50779a1860
Merge pull request #448 from bnamasivayam/fix-trprofile-test-bug
...
Having fixed limits for getRange results in continuously getting tran…
2018-06-08 12:52:50 -07:00
Balachandar Namasivayam
34995d4d64
Address review comments.
2018-06-08 11:51:51 -07:00
Balachandar Namasivayam
20febf5ef9
Address review comments.
2018-06-08 11:24:51 -07:00
A.J. Beamon
e5488419cc
Attempt to normalize trace events:
...
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.
Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.
This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
A.J. Beamon
f1d389448c
Merge pull request #453 from apple/release-5.2
...
Merge release-5.2 into master
2018-06-08 10:41:44 -07:00
A.J. Beamon
6461478695
Merge pull request #452 from apple/release-5.1
...
Merge release-5.1 into release-5.2
2018-06-08 10:41:13 -07:00
Evan Tschannen
953c27e570
Merge pull request #431 from ajbeamon/tlog-rename-variables
...
Rename several variables in TLogServer.actor.cpp to follow our normal camel case conventions.
2018-06-08 10:30:22 -07:00
A.J. Beamon
c9543791fd
Fix case of newSeverity detail in StderrSeverity trace event
2018-06-08 10:24:12 -07:00
Evan Tschannen
7d392689fe
fix: only update metrics for healthy destinations, because unhealthy destinations are already in the source
2018-06-07 18:12:04 -07:00
Evan Tschannen
e4d5817679
fix: we must server getTeam requests before readyToStart is set because we cannot complete relocateShard requests without getTeam responses from both team collections
2018-06-07 16:14:40 -07:00
Balachandar Namasivayam
514b0e3c20
Having fixed limits for getRange results in continuously getting transaction_too_old error in some scenarios.
...
Cutting the limits by half in such cases allows to test to progress.
2018-06-07 15:27:05 -07:00
Evan Tschannen
9f0c16f062
do not build teams which contain failed servers
2018-06-07 14:05:53 -07:00
Balachandar Namasivayam
11b79c6c94
Save fitness info of a process to become a cluster controller. This info is currently lost after a reboot. Save this info and reload it to avoid unnecessary re-recruitments.
2018-06-07 13:07:19 -07:00
Evan Tschannen
b423d73b42
fix: do not finish a shard relocation until all of the storage servers have made the current recovery version durable. This is to prevent dropping a needed storage server as a source for a shard after dropping a remote configuration
2018-06-07 12:29:25 -07:00
Evan Tschannen
f26a2f771d
fix: log router popped one too many versions from messageBlocks
2018-06-05 13:42:48 -07:00
Evan Tschannen
be06938d9d
fix: dropping the remote replication will cause all remote storage servers to die. Make sure we are not restoring redundancy before doing this to prevent data loss in simulation.
2018-06-04 18:46:09 -07:00
Evan Tschannen
6cf9508aae
finished a comment
2018-06-03 19:38:51 -07:00
Evan Tschannen
e95f663ebc
fix: the log router could pop too much data from the logs in rare situations
2018-06-03 19:34:24 -07:00
Evan Tschannen
bf65e745a9
tlogs do not index tags for other localities
2018-06-01 22:51:08 -07:00
Evan Tschannen
c519339adb
avoid peeking from logs that do not match the tag’s locality
2018-06-01 18:42:48 -07:00
Evan Tschannen
ce6a2f0563
Merge pull request #425 from bnamasivayam/leader-election-optimize
...
Optimize client and server connection times to cluster controller, es…
2018-06-01 18:35:27 -07:00
Balachandar Namasivayam
59bfa74197
Address review comments. Refactor getLeader function to mask the first 7 bits of changeID and return the masked LeaderInfo.
2018-06-01 18:23:24 -07:00
Balachandar Namasivayam
529d0497f1
Proxy going OOM when applying high volumes of writes to a proxy, particular in a sudden fashion before ratekeeper can control the workload.
...
Address this issue by proactively monitoring the memory used by commit batches and dropping requests if a certain memory limit is exceeded.
2018-06-01 15:21:40 -07:00
A.J. Beamon
1f0b519a73
Rename several variables in TLogServer.actor.cpp to follow our normal camel case conventions. I didn't rename every variable here, because some appear to be data structures (like a map) following the pattern keydesc_valuedesc, and I wasn't sure that the straightforward keydescValuedesc rename made sense. I did rename a couple of instances of these where it seemed reasonable, though.
2018-06-01 10:18:07 -07:00
Balachandar Namasivayam
9f55ccd4a5
Remove extraneous comments.
2018-05-31 15:32:47 -07:00
A.J. Beamon
78839b20fd
Merge branch 'master' into trace-log-refactor
...
# Conflicts:
# flow/Trace.cpp
2018-05-31 10:46:20 -07:00
Balachandar Namasivayam
070366ca70
Optimize client and server connection times to cluster controller, especially in multi DC configurations.
...
A majority(quorum) answer from co-ordinators was required to connect to cluster controller.
Now a cluster controller is optimistically selected to connect even if there is no quorum.
2018-05-30 16:48:04 -07:00
A.J. Beamon
d9c702a9e3
Merge release-5.1 into release-5.2
2018-05-30 09:09:55 -07:00
Evan Tschannen
0e699a3c23
fix: ratekeeper should only control on local logs
2018-05-29 10:51:23 -07:00
A.J. Beamon
026458baf3
Merge release-5.2 into master
2018-05-23 15:32:56 -07:00
A.J. Beamon
e538fb4065
Add error description to error output when networking could not be initialized.
2018-05-23 15:05:28 -07:00
Alec Grieser
40babc40e1
remove one unnecessary line ; fix else formatting
2018-05-15 17:20:44 -07:00
Alec Grieser
6d132717f2
add versionstamp compatibility test to VersionStampWorkload
...
surfaces error found in #387
2018-05-15 17:09:24 -07:00
Dennis Schafroth
a9f54e1865
Compile on macOS 10.13.4: Use ASSERT_ABORT in destructors. Import fstream
2018-05-15 12:55:02 -07:00
A.J. Beamon
02df30149f
Merge branch 'release-5.2' into trace-log-refactor
2018-05-11 11:22:34 -07:00
Evan Tschannen
91338fc984
Merge branch 'master' into feature-remote-logs
2018-05-10 15:33:45 -07:00
Evan Tschannen
8f984cb2c9
Merge branch 'release-5.2'
...
# Conflicts:
# fdbrpc/TLSConnection.h
2018-05-10 09:13:22 -07:00
Evan Tschannen
d3450ce5b0
Merge pull request #343 from bnamasivayam/tls-plugin
...
Tls plugin
2018-05-09 16:35:53 -07:00
Evan Tschannen
f6e55d0b74
Merge pull request #348 from etschannen/release-5.2
...
DR upgrade tests now test the durability of the data.
2018-05-09 15:40:03 -07:00
Evan Tschannen
8930c2e3db
DR upgrade tests now test the durability of the data.
2018-05-09 15:11:05 -07:00
Balachandar Namasivayam
7591931a09
Revert "Make tls_verify_peers as a comma separated string of constraints."
...
This reverts commit 2033847e4b
.
2018-05-09 14:40:36 -07:00
Balachandar Namasivayam
2033847e4b
Make tls_verify_peers as a comma separated string of constraints.
2018-05-09 14:37:39 -07:00
Alec Grieser
f3093642b3
Merge pull request #242 from alecgrieser/32437306-better-versionstamped-value
...
Unify SET_VERSIONSTAMPED_KEY and SET_VERSIONSTAMPED_VALUE API
2018-05-09 09:04:07 -07:00
Balachandar Namasivayam
e8b7f4b190
Add password support for tls.
2018-05-08 20:46:31 -07:00
Balachandar Namasivayam
49af5d685b
Restore previous behavior of not specifying peer_verify option means disable checking.
2018-05-08 18:54:44 -07:00
Balachandar Namasivayam
d3b5cfb93c
Support latest TLS plugin.
...
Add support for https in backup.
2018-05-08 16:28:13 -07:00
A.J. Beamon
54b4c9e061
Merge branch 'release-5.2' into trace-log-refactor
...
# Conflicts:
# fdbserver/Status.actor.cpp
2018-05-08 15:51:54 -07:00
Evan Tschannen
9f0d244efe
Merge branch 'master' into feature-remote-logs
2018-05-08 13:28:23 -07:00
Evan Tschannen
7acdc314e4
Merge branch 'release-5.2'
...
# Conflicts:
# fdbrpc/TLSConnection.actor.cpp
2018-05-08 13:22:53 -07:00
Evan Tschannen
1f6c6a886b
Merge branch 'release-5.1' into release-5.2
2018-05-08 13:08:11 -07:00
A.J. Beamon
ca720e1540
Merge pull request #297 from apple/release-5.2
...
Merge 5.2 to Master
2018-05-08 12:04:20 -07:00
Alec Grieser
47c9e4f923
update bindings and bindingtester that uses versionstamps to use new protocol
...
issue #148
2018-05-08 08:57:09 -07:00
Alec Grieser
464e2cdbf0
change SetVersionstampedKey and SetVersionstampedValue behavior based on API version to make them consistent
2018-05-08 08:57:09 -07:00
Alec Grieser
14cca75429
server components of version of alternative versionstamp op that writes to an arbitrary place in the value
2018-05-08 08:57:08 -07:00
Evan Tschannen
e8f6ad88f0
fix: tripled the smallStorageTarget to prevent simulations which do a lot of work from timing out
2018-05-07 17:26:44 -07:00
Alec Grieser
752deb07a1
fix fdbmonitor help message output ; fix spelling error Ratekeeper.actor.cpp
2018-05-07 16:19:50 -07:00
Evan Tschannen
4677789b38
fix: low latency tests need 4 machines per datacenter to support triple replication after 1 machine has failed
2018-05-07 11:28:25 -07:00
Evan Tschannen
529bd34cf9
fix: when a tlog is stopped by another recruitment it no longer has the opportunity for commtingQueue to be set
2018-05-06 20:37:44 -07:00
Evan Tschannen
81c7bddaf8
fix: must check for log router errors while waiting on satellite replies because the recruitmentID will not be updated if it threw an error
2018-05-06 18:15:12 -07:00
Evan Tschannen
8cb8198250
fix: the e-brake should be buggified with ratekeeper storage limits to prevent simulation from running full blast into the e-brake resulting in simulation taking forever to complete (joshua timeouts)
2018-05-06 12:33:25 -07:00
Evan Tschannen
cc6511a39e
fix: we do not know that the minimum popped version on the log router is a known committed version until it has advanced.
2018-05-06 09:32:41 -07:00
Evan Tschannen
b1935f1738
fix: do not allow a storage server to be removed within 5 million versions of it being added, because if a storage server is added and removed within the known committed version and recovery version, they storage server will need see either the add or remove when it peeks
2018-05-05 18:16:28 -07:00
Evan Tschannen
8371afb565
fix: log routers need to know if the log system is stopped to determine how they should peek the last log generation
2018-05-05 17:56:00 -07:00
Evan Tschannen
7ed64c821e
fix: recruiting a cluster controller takes longer after restarting tests because we wait until files have recovered from disk before starting
2018-05-05 17:20:48 -07:00
Evan Tschannen
e8ea02e054
fix: storage servers need to fail if they can no longer peek data
2018-05-05 17:19:59 -07:00
A.J. Beamon
432a295bc2
Add read bytes and read keys info to status. Collect this information directly from StorageMetrics rather than through ratekeeper.
2018-05-04 12:01:40 -07:00
A.J. Beamon
ce0c991e78
Refactor trace events to store a vector of fields that aren't encoded until write time. Better support for pre-network trace events. Rework how trace events are queried. Some initial work towards pluggable formatting of logs.
2018-05-02 10:44:38 -07:00
Evan Tschannen
440e2ae609
fix: data distribution logic was incorrect for finding a complete source team in a failed DC
2018-05-01 23:08:31 -07:00
Evan Tschannen
87ad03ce53
locality aware load balancing was disabled on the storage servers because emergency teams might cause a server to be assigned a shard when it does not actually have the data. This problem has been fixed, so we can re-enable locality aware load balancing.
2018-05-01 22:45:22 -07:00
Evan Tschannen
b4bd03e67e
fix: we cannot set queueCommitEnd until we have popped the log system to prevent the popped version from going backwards
2018-05-01 22:20:25 -07:00
Evan Tschannen
12ef63b698
knobify replace contents bytes
2018-05-01 19:43:35 -07:00
Evan Tschannen
656a817e74
fix: only reconfigure during the quiet database check, because excluding at the same time as reconfiguring causes the master to indefinitely restart recovery
2018-05-01 15:31:49 -07:00
Evan Tschannen
c3f2e2bb38
fix: do not attempt to become the cluster controller before recovering files from disk
2018-05-01 12:05:43 -07:00
Evan Tschannen
e27531d39e
Merge branch 'master' into feature-remote-logs
2018-04-30 22:55:46 -07:00
Evan Tschannen
10d25927cd
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
2018-04-30 22:15:39 -07:00
Evan Tschannen
eded5631e6
fix: epoch end was already known committed version + 1, and did not need an additional + 1.
2018-04-30 22:03:11 -07:00
Evan Tschannen
e1e43cff28
endEpoch implemented using getDurableVersion
2018-04-30 18:32:04 -07:00
Alex Miller
bc8e6acbe8
Fix the other half of simulation requiring a TLS Plugin.
...
This commit:
1. Restores --tls_plugin as a way to provide the path to the TLS plugin when running in simulation.
2. Removes the TLS Plugin as being required for 5% of tests.
3. Standardizes on 'sslEnabled' as a variable name.
And is a fix/improvement upon commit f7733d1b
.
(1) previously didn't work, because we would create multiple new TLSOptions
instances and run init_plugin multiple times. Only the first call would use
the argument specified on the command line. To fix this, the TLSOptions
derived from the command line is threaded through all the simulation code that
needs it.
(2) was an oversight in f7733d1b
, which didn't actually make "should we be TLS"
dependant on if the TLS plugin was available or not.
(3) is just nice for trying to grep around in the codebase.
2018-04-30 18:26:29 -07:00
Evan Tschannen
5143871fed
passed debug ids into all versions of peek() to assist debugging
2018-04-30 13:36:35 -07:00
Evan Tschannen
883f2318a0
test fearless configurations
2018-04-30 13:17:29 -07:00
Evan Tschannen
99598d180b
fix: the log router must be initialized with all expected tags to prevent mistakenly choosing a minPopped that is too high
2018-04-30 10:58:41 -07:00
Evan Tschannen
92b134eb98
fix: errors from removed were not handled properly
2018-04-29 23:05:08 -07:00
Evan Tschannen
6f318dbff2
fix: do not reply to recruitment until we are sure the log commits to the queue
2018-04-29 22:08:24 -07:00
Evan Tschannen
9cdabfed0e
added useful trace events
2018-04-29 18:54:47 -07:00
Evan Tschannen
2e286b768d
fix: locality is needed for a logSet to call getPushLocations
...
fix: accidentally deleted allowPops assignment on the log router
2018-04-29 13:47:32 -07:00
Evan Tschannen
dbdeeaa5cf
fix: log routers are given all the information they need to add remote tags in their initialization request
2018-04-28 18:04:57 -07:00
Alec Grieser
69e831d522
Merge remote-tracking branch 'upstream/release-5.2' into merge-release-5.2
2018-04-28 17:44:52 -07:00
Evan Tschannen
33fa8f2cac
fix: make sure log routers only add remote tags from the correct log set
2018-04-28 15:04:13 -07:00
Evan Tschannen
f77c1ec14e
fix: fixed rare bug where a log stopped by a different recruitment would still response successfully to the recruitment message
2018-04-28 13:34:06 -07:00
Evan Tschannen
23c0249d80
fix: old log routers tags must be available at the best location in the new generation
2018-04-28 11:13:10 -07:00
Alec Grieser
a1faaafca3
Merge remote-tracking branch 'upstream/release-5.1' into merge-release-5.1
2018-04-27 16:38:18 -07:00
Yichi Chiang
c721ab6854
Fix review comments
2018-04-27 13:54:34 -07:00
Evan Tschannen
af63dac5dd
fix: remote logs need to wait until the durable known committed version is greater than the recovery version before completing recovery to ensure we will not pick a start version that we do not have
2018-04-27 12:18:42 -07:00
Evan Tschannen
32e9ea3bb4
fix: recruited the wrong number of log routers
2018-04-26 22:22:15 -07:00
Evan Tschannen
d72087bfd3
fix: we may not be able to recruit enough log routers, in this case put multiple log routers on the same worker, but also properly rank this configuration lower in better master exists
2018-04-26 22:18:07 -07:00
Evan Tschannen
a12b994966
fix: log routers need tlogs to be present before accepting data
2018-04-26 18:37:51 -07:00
Evan Tschannen
abcfb0604a
fix: cloneNoMore needs to pass useBestSet
2018-04-26 18:32:12 -07:00
Yichi Chiang
6bddf8aefa
Upgrade DR from 5.1 to 5.2
2018-04-26 17:24:40 -07:00
Evan Tschannen
c7fd85243b
fix: passed the wrong argument value
2018-04-26 13:25:27 -07:00
Evan Tschannen
0dd6931223
fix: remote recruitment must still wait for old log routers to be recruited since they are not needed by the newly recruited logs to finish recovery
2018-04-26 12:55:28 -07:00
Evan Tschannen
721aaa2a6b
fix: we need to monitor old log routers for failures before recovery is complete
...
fix: after configuring out of fearless remote logs will not have all the data until the new configuration
2018-04-26 10:59:21 -07:00
Evan Tschannen
a2b62e15ea
fix: only peek to peekEnd()
2018-04-25 19:56:50 -07:00
Evan Tschannen
7e434348ce
fix: storage servers did not properly pull data when configuring from a fearless setup to a non-fearless setup
2018-04-25 18:20:28 -07:00
Evan Tschannen
fa9089c2e8
fix: removed storage servers must be popped on remote logs from the proxy
2018-04-25 15:38:34 -07:00
Evan Tschannen
471e7b9ab9
fix: update the logSystem on the proxies so that they can pop the txs tag from remote logs
2018-04-25 10:16:31 -07:00
Evan Tschannen
4119a1c5d5
do not add cursors for log sets that have no data
2018-04-24 22:06:10 -07:00
Evan Tschannen
95855dbfc4
correctly filter locality data
2018-04-24 18:14:34 -07:00
Alex Miller
f7733d1bd0
Do not require the TLS Plugin for simulation.
...
It appears that explicit calls to TLS-related things had snuck in over time,
which meant that simulation runs that weren't even configured to use SSL still
wanted and required the TLS plugin.
This commit instead threads through the understanding of if any TLS-related
options were provided, and if not, then don't call anything TLS-related so that
we don't require the TLS plugin.
Hopefully this makes life easier for the opensource folk. :)
2018-04-24 16:53:30 -07:00
Evan Tschannen
35b2ca820a
fix: certain tlog errors during remote recovery could fail to kill the master, the master could have a reference counting cycle with its actor collection
2018-04-24 16:10:14 -07:00
Evan Tschannen
1cfe1cb7f0
fix: do not let the storage server process an exhausted version, because it could prevent a rollback
2018-04-23 22:03:55 -07:00
Evan Tschannen
ae1de575f1
fix: remote logs are not considered fully recovered until they are at recoveredAt
2018-04-23 17:49:46 -07:00
Evan Tschannen
3ec09ce9f6
fix: only peekSingle needs to throw worker_removed, because tlogs have other ways to get notified they are no longer needed
...
fix: we need to wait until tags are popped past recoveredAt instead of unrecovered before
2018-04-23 16:43:08 -07:00
Evan Tschannen
126fc53d10
fix: the start version for peek cursors that merge with multiple log sets is the maximum of the individual start versions
2018-04-23 12:42:51 -07:00
tracebundy
dd36f55a90
Update fdbserver.actor.cpp
...
fix the bug 'fdbserver/fdbserver.actor.cpp:761:16: error: aggregate ‘std::ifstream ifs’ has incomplete type and cannot be defined'
2018-04-23 10:06:15 -07:00
Dennis Schafroth
290122637b
Using ASSERT_ABORT in destructors
2018-04-23 14:05:10 +02:00
Evan Tschannen
73597f190e
fix: new tlogs are initialized with exactly the tags which existed at the recovery version
2018-04-22 20:28:01 -07:00
Evan Tschannen
a520d03397
fix: if we cannot find a tag, it must have been popped at the recovery version.
2018-04-22 15:08:38 -07:00
Evan Tschannen
ef23136809
fix: ensure the logSystemConfig is updated with newly recruited log routers
2018-04-22 11:54:39 -07:00
Evan Tschannen
fceec020de
fix: use the known committed version if the last generation primary logs were in the same data center as this generation
...
the known committed version in end epoch is the maximum seen in all responses regardless of log set
2018-04-22 11:14:13 -07:00
Evan Tschannen
c3a344d44e
fix: do not choose a remote start version past the start of the locked logs
2018-04-21 16:03:28 -07:00
Evan Tschannen
28a1fa9dc2
fix: we need to notify the old log system that its recruitmentID has changed
2018-04-21 12:57:00 -07:00
Evan Tschannen
1d1e2cd367
fix: initialize the known committed version on the tlog
2018-04-21 00:41:15 -07:00
Evan Tschannen
8d350ceb5f
fix: persist the known committed version on the tlogs
2018-04-20 17:55:46 -07:00
Evan Tschannen
a6d9e889f0
a cleaner solution to preventing tlogs from peeking log routers
2018-04-20 13:25:22 -07:00
Evan Tschannen
f5c3417905
fix: prevent tlogs from peeking the wrong log routers
2018-04-20 00:30:37 -07:00
Evan Tschannen
5da452db8e
fix: pop the log routers again after the log system updates
2018-04-19 14:33:31 -07:00
Bruce Mitchener
2f8a0240f1
Fix some typos.
2018-04-19 11:44:01 -07:00
Bruce Mitchener
9cdf25eda3
Fix some typos.
2018-04-20 00:49:22 +07:00
Evan Tschannen
d46d5487bd
Merge branch 'release-5.2'
2018-04-18 20:46:03 -07:00
Evan Tschannen
57d650062a
merge 5.1 into 5.2
2018-04-18 20:44:31 -07:00
Evan Tschannen
224621be04
fix: extraDB==0 must leave g_simulator.extraDB as null, so that non-DR tests do not attempt to use a DR database
2018-04-18 19:34:35 -07:00
Evan Tschannen
22526ef996
fix: do not tell storage servers about large sections of empty versions, because it can lead them to make mutations durable which have not been committed
2018-04-18 16:06:44 -07:00
Evan Tschannen
447c7bd15b
fix: log routers use durable known committed version at the time of the pop to determine what is safe to pop from their logs
...
fix: storage server does not advance its version across large version increase until it has data associated with the version
2018-04-18 12:07:29 -07:00
Evan Tschannen
e43fb6d8bc
fix: the log routers were popping too many versions because the known committed version is less than minPopped version
2018-04-17 19:41:36 -07:00
Evan Tschannen
c1ccc8522c
Merge branch 'release-5.2'
2018-04-17 18:38:12 -07:00
Evan Tschannen
db98c1b9b6
Merge branch 'release-5.1' into release-5.2
...
# Conflicts:
# versions.target
2018-04-17 18:36:19 -07:00
Evan Tschannen
8569a85771
fix: only let a log router pop if they tlog it is serving is fully recovered
2018-04-17 15:03:22 -07:00
Evan Tschannen
760bc8bc99
fix: log router version needs to be fetched before it is available
...
fix: tlog did not fetch known committed version if start version was exactly equal to it
2018-04-17 11:16:48 -07:00
Evan Tschannen
093908b83f
fix: log routers were starting one version too late
2018-04-17 00:29:16 -07:00
Evan Tschannen
3e40505f4a
Revert "fix: remote logs should reply until they have recovered through recoverAt"
...
This reverts commit 3c0c03c004
.
2018-04-16 23:17:16 -07:00
Evan Tschannen
3c0c03c004
fix: remote logs should reply until they have recovered through recoverAt
2018-04-16 17:25:49 -07:00
Evan Tschannen
cef6c9b418
fix: the startVersion cannot be larger than the known committed version
2018-04-16 16:21:27 -07:00
Evan Tschannen
dcfa1847ff
fix: log router’s starting popped version must be less than its starting version
2018-04-16 11:43:03 -07:00
Evan Tschannen
3018a7b1b3
fix: the known committed version of a newly initialized log is 1, since by definition the first commit must have succeeded
2018-04-16 10:42:48 -07:00
Evan Tschannen
a8662f8737
fix: remote recovered is does not need to wait for old logs to be removed
2018-04-16 10:14:39 -07:00
Evan Tschannen
e53f17a83a
fix: the newest log router needs to start where the last old one ends
2018-04-15 14:54:22 -07:00
Evan Tschannen
5533016f1e
fix: tlogs are now initialized immediately, instead of when starting the core, this must be done to pop the log routers during recovery
...
fix: log router start version must be the same as remote log start version
2018-04-15 14:33:07 -07:00
Evan Tschannen
0496bee1ef
fix: suppress expected errors in data distribution
2018-04-15 11:30:22 -07:00
Evan Tschannen
041f5787fb
fix: peekLocal does not stop when a locality does not exist
...
fix: lock logs only stops on special or upgraded locality
fix: recruiting old log routers respects the passed in startVersion
2018-04-14 19:06:24 -07:00
Evan Tschannen
f5141acae9
fix: log routers need all logs present in their log system since they call addRemoteTags
2018-04-13 17:33:36 -07:00
Evan Tschannen
65e69620a7
fix: unrecoveredBefore on a new log is at minimum 1
2018-04-13 10:41:30 -07:00
Yichi Chiang
a4e8b6492c
Fix DR Upgrade workload backup range
2018-04-13 09:59:32 -07:00
Evan Tschannen
c589630e53
fix: log router start version is based on the start version of the local logs
2018-04-12 18:14:23 -07:00
Evan Tschannen
3b7e4410cf
fix: protect from peeking too early of a version from a log router
2018-04-12 16:15:17 -07:00
Evan Tschannen
1af5ac0d9d
fix: a number of different problems prevented tlogs from using log routers during recovery
2018-04-12 15:20:54 -07:00
Evan Tschannen
c6229e443c
fix: do not use resolution class when using regions
2018-04-11 21:22:53 -07:00
Evan Tschannen
4248fbec61
fix: must set startVersion when upgrading
2018-04-11 17:33:17 -07:00
Evan Tschannen
19762b847d
Merge branch 'release-5.2'
...
# Conflicts:
# fdbserver/DatabaseConfiguration.cpp
# fdbserver/SimulatedCluster.actor.cpp
2018-04-10 17:02:43 -07:00
Evan Tschannen
c1ba16b3c8
Merge branch 'release-5.1' into release-5.2
...
# Conflicts:
# bindings/java/src/test/com/apple/foundationdb/test/AbstractTester.java
# bindings/java/src/test/com/apple/foundationdb/test/VersionstampSmokeTest.java
# bindings/nodejs/lib/fdb.js
# bindings/nodejs/src/Version.h
# bindings/nodejs/tests/tuple_test.js
2018-04-10 16:50:47 -07:00
Evan Tschannen
b0a88001cc
Merge pull request #132 from yichic/support-dr-upgrade-test
...
Support DR upgrade test
2018-04-10 16:30:19 -07:00
Evan Tschannen
b46c32535c
surpassed spammy trace events
2018-04-10 15:52:32 -07:00
Yichi Chiang
d0230d4d13
Support DR upgrade test in 5.1
2018-04-10 15:19:53 -07:00
Alex Miller
b289312a37
Merge pull request #120 from alecgrieser/storage-class-help-text
...
Add router to help text for storage class of fdbserver
2018-04-10 15:01:27 -07:00
Evan Tschannen
3453a51d0f
remoteRecovery was still swallowing errors
2018-04-10 13:31:24 -07:00
Evan Tschannen
5fcedd2e98
fix: coordinated state errors were being eaten
2018-04-10 11:14:57 -07:00
Evan Tschannen
2ab2c788b3
fix: the start version is allowed to be larger than the recovery version
2018-04-09 21:58:14 -07:00
Evan Tschannen
a738c4bec1
fix: if the known committed version is equal to the recovery version we do not need to copy any data
2018-04-09 20:48:55 -07:00
Evan Tschannen
419951f601
fix: need to initialize tlog versions to less than the startVersion
2018-04-09 17:17:11 -07:00
Evan Tschannen
27e14790b1
fix: do not start at a version larger that the recovery version
2018-04-09 15:08:01 -07:00
Evan Tschannen
7566a0d109
fix: endEpoch gets its logs from the core state, so by definition they are written
2018-04-09 11:44:54 -07:00
Evan Tschannen
4c89f721cd
fix: do not include logRouter tags in lock results
2018-04-09 10:48:57 -07:00
Evan Tschannen
7af892f50b
first working version of non-copying recovery working with fearless configurations
2018-04-08 21:24:05 -07:00
Alex Miller
0136a01c18
Fix "Not enough physical servers available" error due to incorrect server calculation.
2018-04-05 15:13:21 -07:00
Evan Tschannen
bc938d9273
fix: storage recruitment could get stuck in a spin loop
2018-04-03 18:06:31 -07:00
Evan Tschannen
331e707684
fix: pop all tags that did not have data at the recovery version because fully popped tags may come back when pullAsyncData re-indexes the mutations
2018-03-31 16:47:56 -07:00
Evan Tschannen
96fffe2cea
fix: do not update version if the log has been stopped
2018-03-30 22:11:42 -07:00
Evan Tschannen
4fb2b99341
fix: using only one region still means we need 3 machines per datacenter, the other machines in the other datacenters just won’t be used
2018-03-30 19:26:22 -07:00
Evan Tschannen
579ba58930
pop old tags only looks are recovered tags, and checks if they are still being used
2018-03-30 19:08:01 -07:00
Evan Tschannen
8352b93f48
fix: do not reuse tags that are still in historyTags, pop historyTags past epochEnd to allow tlogs to finish recovery
...
fix: peekLocal did not properly respect end
fix: the storage server added to the end of the history vector instead of the beginning
2018-03-30 17:39:45 -07:00
Evan Tschannen
43cb63df25
fix: the collectTags bool was set incorrectly
2018-03-29 18:19:29 -07:00
Evan Tschannen
1a4ded1c99
support upgrades by merging tags associated with the different peek requests
2018-03-29 17:54:08 -07:00
Evan Tschannen
b36e08f08f
first version of non-copying recovery. Upgrades are broken, and it has not been tested using fearless configurations yet
2018-03-29 15:12:38 -07:00
Evan Tschannen
da737e1ea3
suppress the BestTeamStuck trace event
2018-03-26 18:32:32 -07:00
Evan Tschannen
82ed956c65
renamed the multi_dc configuration to three_datacenter. The old three_datacenter configuration was not a useful configuration.
2018-03-26 18:31:26 -07:00
Evan Tschannen
b95e68eb5a
fix: getDatabaseSize is really inefficient and causes slow tasks in the real world. Outside of simulation just assume the database is really large, because we only need the InvalidShardSize check in simulation
2018-03-26 17:35:11 -07:00
Alec Grieser
bb5f3ebb6d
add router to help text for storage class of fdbserver
2018-03-26 13:26:56 -07:00
Evan Tschannen
d3fb17d30a
Merge pull request #74 from bnamasivayam/client-profiling-tests
...
Client profiling tests - Part 1
2018-03-23 16:52:49 -07:00
Balachandar Namasivayam
1e719d79e9
Remove incorrect ASSERT's
...
Account for corner cases in missing chunks.
2018-03-23 15:51:56 -07:00
Evan Tschannen
5db52ab081
Merge pull request #87 from etschannen/feature-remote-logs
...
Feature remote logs
2018-03-23 12:55:17 -07:00
Evan Tschannen
7c48e1d31c
Update SimulatedCluster.actor.cpp
2018-03-23 12:54:44 -07:00
A.J. Beamon
ddc0c613ed
Merge pull request #109 from apple/release-5.2
...
Merge Release 5.2 into master
2018-03-21 09:37:56 -07:00
Clement Pang
64deb0e0a1
Address review comments.
2018-03-20 14:38:04 -07:00
Clement Pang
b46ffb4cbc
Available space should take into account both memory and disk
2018-03-20 14:38:04 -07:00
Evan Tschannen
0746fe4d56
optimized tag lookups on the tlog by removing one level of vectors
2018-03-20 10:41:42 -07:00
Evan Tschannen
d8e064d8bb
fix: when a new log is recruited on a shared log, all outstanding commits need to be notified that they are stopped, because there is no longer a guarantee that their queueCommittedVersion will advance
2018-03-19 17:48:28 -07:00
Alec Grieser
551ea9c7f8
Merge remote-tracking branch 'upstream/release-5.2' into master-release-5.2-merge
2018-03-19 12:34:50 -07:00
yichic
ede5cab192
Merge pull request #89 from yichic/share-log-mutations-5.2
...
Share log mutations 5.2
2018-03-19 12:01:26 -07:00
Yichi Chiang
1f2602d2b3
Fix all review comments
2018-03-19 11:33:33 -07:00
Yichi Chiang
d6559b144f
Share log mutations between backups and DRs which have the same backup range
2018-03-19 11:32:50 -07:00
Evan Tschannen
54be14000d
do not deserialize tags
2018-03-17 11:24:18 -07:00
Evan Tschannen
4dcef08260
optimized the log router to use a vector instead of a map for tag data
2018-03-17 11:08:37 -07:00
Evan Tschannen
9c8cb445d6
optimized the tlog to use a vector for tags instead of a map
2018-03-17 10:36:19 -07:00
Evan Tschannen
fecfea0f7d
fix: messages vector was not cleared
2018-03-17 10:24:44 -07:00
Balachandar Namasivayam
9e3e3c8561
Add some sanity checks to deserialized data.
2018-03-16 18:45:25 -07:00
Yichi Chiang
f12c1d811c
Fix all review comments
2018-03-16 18:09:23 -07:00
Yichi Chiang
26b93ff920
Share log mutations between backups and DRs which have the same backup range
2018-03-16 18:09:23 -07:00
Evan Tschannen
ccd70fd005
The tlog uses the tags embedded in the message instead of a separate vector of locations
...
optimized remote tlog committing to avoid re-serializing the message
2018-03-16 16:47:05 -07:00
Evan Tschannen
820382ea68
optimized the log router commit path to avoid re-serializing the data
2018-03-16 11:40:21 -07:00
Evan Tschannen
a42205eb8e
test running with only one region
2018-03-15 15:40:58 -07:00
Balachandar Namasivayam
89d7cc1093
Minor Bug fixes...
2018-03-15 11:00:47 -07:00
Evan Tschannen
82fb6424ec
fix: storage recruitment could get stuck in a spin loop
2018-03-15 11:00:44 -07:00
Evan Tschannen
65b532658f
added support for single region configurations
2018-03-15 10:59:30 -07:00
Alec Grieser
0853fcb052
switch to using zu for some size_t variables in printf
2018-03-14 18:07:05 -07:00
Evan Tschannen
59723f51f8
fix: continue to attempt to lock logs until remote logs are recovered, this is so that remote logs get locked and readers know they will not have any more data
...
do not throttle trace events in simulation
2018-03-14 12:39:55 -07:00
Balachandar Namasivayam
856d2a0a9d
Add correctness tests for Client transaction profiling data format. It also includes format check across upgrades.
2018-03-14 12:39:50 -07:00
Alec Grieser
70a05c1a9b
fix some compiler whinges
2018-03-13 15:00:16 -07:00
Evan Tschannen
2e741057d4
use references instead of copying regionInfo
2018-03-13 12:59:07 -07:00
Evan Tschannen
f6a22c1035
fix: the recovery actor was holding a copy of the tlogInterface after the tlog was removed
2018-03-12 16:56:34 -07:00
Evan Tschannen
72d56a700c
fix: do not serialize an a tlog interface without a unique id
2018-03-10 09:52:09 -08:00
Evan Tschannen
c74211bd92
fix: merge problem
2018-03-09 16:52:37 -08:00
Evan Tschannen
3abf4d7fdf
Merge branch 'master' into feature-remote-logs
2018-03-09 14:50:04 -08:00
Evan Tschannen
91bb8faa45
Merge commit 'f773b9460d31d31b7d421860fc647936f31aa1fa'
...
# Conflicts:
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2018-03-09 14:47:03 -08:00
Evan Tschannen
28ea983487
Merge branch 'release-5.1' into release-5.2
...
# Conflicts:
# flow/Trace.cpp
# versions.target
2018-03-09 14:40:31 -08:00
A.J. Beamon
bb9f51bb5c
Don't try to extract attributes from the program start trace events if they couldn't be collected.
2018-03-09 11:55:57 -08:00
Evan Tschannen
cf6dd1437b
suppress spammy trace events
2018-03-09 10:16:34 -08:00
Evan Tschannen
ae7d8e90b2
Merge branch 'release-5.1' of github.com:apple/foundationdb into release-5.1
2018-03-09 09:56:09 -08:00
Evan Tschannen
5390af8be4
suppress spammy logs
2018-03-09 09:40:36 -08:00
A.J. Beamon
1bf9f0ec6b
Merge pull request #54 from etschannen/release-5.1
...
fix: new cluster controllers should not consider anything failed unti…
2018-03-09 09:28:21 -08:00
Evan Tschannen
f9625f5b2f
fix: new cluster controllers should not consider anything failed until they have time to get failure monitoring updates
...
fix: storage and log class machines wait 100MS before attempting to become the cluster controller
2018-03-08 18:08:41 -08:00
Balachandar Namasivayam
e7309a3535
Add trace events to print the ranges in ConsistencyCheck.
2018-03-08 13:53:59 -08:00
Evan Tschannen
cf9d02cdbd
Merge pull request #48 from apple/release-5.2
...
Merge release-5.2 into master
2018-03-08 13:21:26 -08:00