Commit Graph

587 Commits

Author SHA1 Message Date
Evan Tschannen 3abf4d7fdf Merge branch 'master' into feature-remote-logs 2018-03-09 14:50:04 -08:00
Evan Tschannen 91bb8faa45 Merge commit 'f773b9460d31d31b7d421860fc647936f31aa1fa'
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-03-09 14:47:03 -08:00
Evan Tschannen 28ea983487 Merge branch 'release-5.1' into release-5.2
# Conflicts:
#	flow/Trace.cpp
#	versions.target
2018-03-09 14:40:31 -08:00
A.J. Beamon bb9f51bb5c Don't try to extract attributes from the program start trace events if they couldn't be collected. 2018-03-09 11:55:57 -08:00
Evan Tschannen cf6dd1437b suppress spammy trace events 2018-03-09 10:16:34 -08:00
Evan Tschannen ae7d8e90b2 Merge branch 'release-5.1' of github.com:apple/foundationdb into release-5.1 2018-03-09 09:56:09 -08:00
Evan Tschannen 5390af8be4 suppress spammy logs 2018-03-09 09:40:36 -08:00
A.J. Beamon 1bf9f0ec6b
Merge pull request #54 from etschannen/release-5.1
fix: new cluster controllers should not consider anything failed unti…
2018-03-09 09:28:21 -08:00
Evan Tschannen f9625f5b2f fix: new cluster controllers should not consider anything failed until they have time to get failure monitoring updates
fix: storage and log class machines wait 100MS before attempting to become the cluster controller
2018-03-08 18:08:41 -08:00
Balachandar Namasivayam e7309a3535 Add trace events to print the ranges in ConsistencyCheck. 2018-03-08 13:53:59 -08:00
Evan Tschannen cf9d02cdbd
Merge pull request #48 from apple/release-5.2
Merge release-5.2 into master
2018-03-08 13:21:26 -08:00
A.J. Beamon 2c92ef8ff8
Merge pull request #47 from apple/release-5.1
Merge Release 5.1 into Release 5.2
2018-03-08 13:18:45 -08:00
A.J. Beamon 73cec8abad Merge branch 'release-5.1' of github.com:apple/foundationdb into release-5.1 2018-03-08 11:47:44 -08:00
Balachandar Namasivayam 4f58bca66a Simple refactor of code... 2018-03-08 11:34:25 -08:00
Balachandar Namasivayam 1c1a497ea2 Refactor getKeyServers to be more readable.
Fix possible memory corruption by returning KeyRange instead of KeyRangeRef in getKeyServers.
Simplify getMasterProxies on DatabaseContext class.
2018-03-08 11:34:18 -08:00
Balachandar Namasivayam 03a40354e3 Having 1000 as the limit for Limit for GetKeyServerLocationsRequest sometimes generate large packet warnings. Reduce it to 100.
Fix the bug where some of the key server shards may not be fetched.
2018-03-08 11:34:11 -08:00
A.J. Beamon fdcaf473ae Don't pass a copy of the StorageServerInterface to storageServerRollbackRebooter. This prevents a situation where the storage server has terminated but the request streams are left open until the underlying KV-store gets closed. 2018-03-08 11:14:24 -08:00
Evan Tschannen fa7eaea7cf fix: shards affected by team failure did not properly handle separate teams for the remote and primary data centers 2018-03-08 10:50:05 -08:00
bnamasivayam f838bc077e
Merge pull request #36 from ajbeamon/release-5.2
Set the address in consistency check processes…
2018-03-07 15:00:14 -08:00
Evan Tschannen 9d4cdc828b fix: inactive cursors are still useful if their version is larger than the current version 2018-03-07 12:54:53 -08:00
Evan Tschannen 68606c7984 fix: sim2 logic for when a kill is safe was incorrect 2018-03-06 18:38:05 -08:00
Alec Grieser 2a2ac56529
Merge pull request #22 from alecgrieser/37844532-expose-append-if-fits
Expose APPEND_IF_FITS to clients
2018-03-06 16:31:36 -08:00
Evan Tschannen 8c88041608 fix: we must commit to the number of log routers we are going to use when recruiting the primary, because it determines the number of log router tags that will be attached to mutations 2018-03-06 16:31:21 -08:00
A.J. Beamon 232bd496bf Set the address in consistency check processes in the same way we set it for clients so that it shows up in trace logs. Disallow specifying a public address for consistency check processes. 2018-03-06 15:40:04 -08:00
A.J. Beamon 7f8f655b9c Revert "Fix build errors"
This reverts commit 51804f0504.
2018-03-06 10:28:39 -08:00
A.J. Beamon f2c804e14f Reverting changes from merge of master into release-5.2 (b25810711c). Note that we never intend to release master into release-5.2, but if we did we would need to revert this commit. 2018-03-06 10:15:04 -08:00
Evan Tschannen 1194e3a361 added region-based configuration to support a large variety of fearless setups. Currently only 1 primary 1 remote setups are allowed. 2018-03-05 19:27:46 -08:00
Balachandar Namasivayam aea1f7ba21 Add tests for Client Transaction Profiling correctness 2018-03-05 18:55:23 -08:00
Balachandar Namasivayam 51804f0504 Fix build errors 2018-03-05 15:18:14 -08:00
A.J. Beamon b25810711c
Merge branch 'master' into release-5.2 2018-03-05 10:32:57 -08:00
Balachandar Namasivayam 8ae640c062 Addressed review comments. 2018-03-02 17:56:49 -08:00
Alec Grieser 218b7a41e2 add APPEND_IF_FITS to workload and remove guard ; add command to vexillographer 2018-03-02 17:43:39 -08:00
Balachandar Namasivayam 11df1aeabf Add new api to get shared tlogs id and address 2018-03-02 16:50:30 -08:00
Evan Tschannen 470f5c01f3 changed remoteDcId to a vector of ids, to support future configurations where there are multiple remote databases 2018-02-26 17:09:09 -08:00
Evan Tschannen a67296b373 do not test fearless configurations to merge with master 2018-02-26 13:31:06 -08:00
Evan Tschannen 8e966fdf9c simulated cluster tests all configurations. Still needs to randomize the remote and satellite replication, along with them number of remote tlogs, log routers, and satellite tlogs 2018-02-26 13:15:44 -08:00
Evan Tschannen e3c6b66240 fix: do not commit more data after being stopped
fix: prioritize dc locality above exclusion to prevent being stuck after excluding all machines in a data center
2018-02-26 13:13:37 -08:00
Evan Tschannen 37a6a81634 Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
# Conflicts:
#	fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00
Evan Tschannen cfcf98cffc fix: log router tags were not stored at a best location 2018-02-23 12:26:19 -08:00
Evan Tschannen a49e43000e fix: did not peek from log routers correctly 2018-02-22 16:13:56 -08:00
Evan Tschannen 719bb5bd0c
Merge pull request #4 from bnamasivayam/getKeyServers-refactor
Having 1000 as the limit for Limit for GetKeyServerLocationsRequest s…
2018-02-22 12:39:48 -08:00
Balachandar Namasivayam 2fe2b522d5 Simple refactor of code... 2018-02-22 12:38:14 -08:00
Alec Grieser e1162e9238 Merge remote-tracking branch 'upstream/release-5.1' 2018-02-22 11:16:12 -08:00
Balachandar Namasivayam e2030db5a8 Refactor getKeyServers to be more readable.
Fix possible memory corruption by returning KeyRange instead of KeyRangeRef in getKeyServers.
Simplify getMasterProxies on DatabaseContext class.
2018-02-21 17:11:50 -08:00
Evan Tschannen 2aa273df96 addStorageServer was advancing tags too much because of read errors 2018-02-21 17:05:39 -08:00
Evan Tschannen 310f56d98a fix: tlogs was resized incorrectly 2018-02-21 15:28:02 -08:00
Evan Tschannen ddb484143c fix: do not peek from remote logs if they are not fully recovered 2018-02-21 14:06:44 -08:00
Alec Grieser 0bae9880f1 remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py 2018-02-21 10:25:11 -08:00
Balachandar Namasivayam 6218934c7b Having 1000 as the limit for Limit for GetKeyServerLocationsRequest sometimes generate large packet warnings. Reduce it to 100.
Fix the bug where some of the key server shards may not be fetched.
2018-02-20 17:41:34 -08:00
Evan Tschannen 1dc6a8d4bd fix: the tlog can peek from log systems that have been recovered even if it does not match its recoverFrom set 2018-02-20 14:50:13 -08:00
Alec Grieser aadc06de99 Merge remote-tracking branch 'upstream/release-5.1' 2018-02-20 14:28:29 -08:00
Evan Tschannen 9ea963ddd6 fix: the master did not detect core state changes if it changed while writing
fix: do not attempt to use three_data_hall when in a fearless deployment
fix: log router tags are ephemeral and can be cleared after every recovery
2018-02-19 16:49:57 -08:00
Evan Tschannen 1b5628d2c5 testing a single configured fearless setup in simulated cluster
consolidated simulation connection disablers into one call in the tester
automatically reconfigure from a fearless setup in simulation
2018-02-18 12:59:43 -08:00
Evan Tschannen 31b89a638f added satellite_none and remote_none options to unconfigure from a fearless setup
fix: log_router configuration was broken
2018-02-17 13:51:17 -08:00
Stephen Atherton 54fc81b260 Improved backup error reporting in backup status. The most recent error for each error type is reported along with how long ago the error occurred, and errors are divided into two categories based on whether or not they occurred since the most recent backup progress. 2018-02-16 19:38:31 -08:00
Evan Tschannen dc93759e15 suppressed trace events that are spammy 2018-02-16 16:01:19 -08:00
Evan Tschannen cb25564d38 simulated cluster supports fearless configurations
removed unused simulation variables
run the simulation with only 1 coordinator most of the time, since we protect the coordinator from being killed, and protecting too many things is bad for simulation
2018-02-15 18:32:39 -08:00
Evan Tschannen ad19d3926b fix: make sure there are enough machines in each dc to support triple replication for the configure workload 2018-02-14 17:06:22 -08:00
Evan Tschannen 5303962af6 re-enabled configure database and remove servers safely, even though they do not work with fearless 2018-02-14 16:07:23 -08:00
Evan Tschannen ead3892e77 fix: prevent fast spin for future version 2018-02-14 15:16:18 -08:00
Evan Tschannen 110309272c fix: do not count a server as read-write unless it has a recent version, because it could have been readable a long time ago 2018-02-14 15:09:19 -08:00
A.J. Beamon 3300c2efed Enable slow task profiling in the consistency check processes. 2018-02-14 09:50:12 -08:00
Evan Tschannen d2b0c07558 storage servers continue to attempt to pop old tags after the log system updates 2018-02-13 18:34:13 -08:00
Evan Tschannen 1fedcba890 fix: do not use log router tags when configured without remote logs
fix: data distribution tracks undesired storage servers
re-enabled consistency check
2018-02-13 17:01:34 -08:00
Evan Tschannen a52ea4eb78 restored 5.1 functionality of simulated cluster. Will test assigned primary and remote data centers. Does not test remote replication or satellite logs 2018-02-10 13:27:51 -08:00
Evan Tschannen 42405c78a5 Merge commit '4038bd2fd968d88861f2cebd442ce511724816cb' into feature-remote-logs
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/Knobs.cpp
2018-02-10 12:08:52 -08:00
Evan Tschannen fbadcc6eea changing a storage server’s tag must be the first mutations applied in a version, because privatized mutations applied earlier in the same version will use the old tag 2018-02-09 18:21:29 -08:00
Evan Tschannen c7b3be5b19 re-enabled better master exists
the cluster controller can choose a better data center for itself and let the workers know where the next cluster controller should be recruited
2018-02-09 16:48:55 -08:00
Stephen Atherton acb876d520 Merge branch 'release-5.1' 2018-02-07 15:11:52 -08:00
Evan Tschannen d0caffd339 fix: knob was set to incorrect value 2018-02-06 18:11:45 -08:00
Stephen Atherton 3a49211c44 Merge branch 'release-5.1' 2018-02-06 13:58:35 -08:00
Stephen Atherton 7de40413d5 Merge branch 'release-5.1' of github.com:apple/foundationdb into release-5.1 2018-02-06 13:44:25 -08:00
Stephen Atherton 0792d5e3dd Fix: last restorable version for a backup tag name (a separate value from the latest restorable version for a configured backup) was not being updated.
Fix: backup blob speed was sometimes an error because the JSON $sum merge operator did not support mixed numeric types.
Fix: JSON merge operator handling was squashing errors in some cases, which was generally obscuring the backup speed metric issue.
Cleaned up some of the JSON object merging logic.
Improved error messages in JSON merge operators.  Added JSON merge operator tests for mixed numeric math and improved readability of test output.
2018-02-06 13:44:04 -08:00
Evan Tschannen b7dde88029 fix: the cluster controller did not consider the master sharing the same process as the cluster controller as bad in all needed locations
waited too long for good recruitment locations, which would add too much time to recoveries of clusters that do not use machine classes
2018-02-06 11:30:05 -08:00
Evan Tschannen 63a9f2aed6 fix: history tags were being incorrectly popped
fix: history tags were not cleared when a storage server was removed
2018-02-03 12:20:18 -08:00
Evan Tschannen ebd94bb654 removed a separately configurable storage team size for the remote data center, because it did not make sense
fix: the master did not monitor for the failure of remote logs
stop merge attempts when a data center is failed
fixed a variety of other problems with data distribution when a data center is failed
2018-02-02 11:46:04 -08:00
Evan Tschannen 766964ff48 fix: dest tags were not repopulated when the tag cache was cleared 2018-01-31 17:35:48 -08:00
A.J. Beamon 0c601d6f85 Purge past version references 2018-01-31 12:05:41 -08:00
Evan Tschannen 6b54d56ca7 gracefully exit if attempting to upgrade from 4.X versions 2018-01-30 17:10:50 -08:00
Evan Tschannen b48d8ce96d getTeam will return an unhealthy exact match if all teams are unhealthy. Resubmit relocation requests once healthy teams are available 2018-01-30 17:00:51 -08:00
Evan Tschannen 4160765fa1 added a buggify which reboots a server immediately after it has changed its locality 2018-01-29 18:21:28 -08:00
Evan Tschannen af97a512f5 to support more complicated policies in the future for determining the best location for a tag within a set of tlogs, use an integer instead of a bool 2018-01-29 17:48:18 -08:00
Evan Tschannen 497bc3fe83 fix: txsTag needs to choose the same best location as 5.X version of the software 2018-01-29 17:09:35 -08:00
Evan Tschannen 29c5d4ad3d upgrades from 5.X mostly supported, still some remaining correctness problems 2018-01-28 11:52:54 -08:00
Evan Tschannen 79d94214a4 Merge commit 'f4ffc9752b5ec66ac47f5f684a5d8be06a7eae6e' into feature-remote-logs 2018-01-25 10:12:06 -08:00
A.J. Beamon 2744646090 Merge branch 'release-5.0' into release-5.1 2018-01-22 11:57:58 -08:00
A.J. Beamon 188562ccbc fix: Status should create its DatabaseConfiguration using fromKeyValues(). This makes sure that various state is correctly set if not specified in the configuration. 2018-01-22 11:40:08 -08:00
Evan Tschannen 66b2218989 added tlog support for upgrading from 5.X clusters. Does not support upgrading from 4.X or earlier. Untested, storage servers still need the ability to change their tag. 2018-01-21 12:21:46 -08:00
Evan Tschannen 698ef4117e Merge branch 'master' into feature-remote-logs 2018-01-20 10:34:30 -08:00
Evan Tschannen b5eba4f13a fix: do not check for desired data centers if they have not been set 2018-01-20 10:28:59 -08:00
A.J. Beamon 35b91bfb55 Add back (in different form) some ratekeeper trace events when a storage server or log doesn't respond. Add actualTPS (named TPSBasis) to RkUpdate. 2018-01-18 14:51:38 -08:00
Evan Tschannen b78e0a362a fix: do not pause when running multiple backup tests simultaneously 2018-01-18 12:24:33 -08:00
Evan Tschannen 2e46ee3dba fix: getTeam works when there are no teams 2018-01-17 17:49:13 -08:00
Evan Tschannen 264dc44dfa fixed many more bugs associated with running without remote logs 2018-01-17 17:03:17 -08:00
Stephen Atherton 93b34a945f Major usability and performance improvements to backup management. Backup descriptions now calculate and display timestamps using TimeKeeper data (if given a cluster) and restorability of snapshots. Expire now requires a --force option to leave a backup unrestorable or unrestorable after a given point in time, specified by version or timestamp. BackupContainerFilesystem now maintains metadata on key version boundaries in order to avoid large list operations for describe and expire operations. Blob parallel recursive list operations can now take a path (aka prefix) filter function. New describe and expire options are available in fdbbackup. 2018-01-17 04:09:43 -08:00
Evan Tschannen 8f58bdd1cd fixed a large number of problems related to running without remote logs 2018-01-16 18:12:40 -08:00
Evan Tschannen 316e200a0c fix: compilation errors after merge 2018-01-16 10:48:50 -08:00
Evan Tschannen 21482a45e1 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DBCoreState.h
#	fdbserver/LogSystem.h
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/TLogServer.actor.cpp
2018-01-14 13:40:24 -08:00
Evan Tschannen 645dc5ead6 warmRange needs to get a read version occasionally to prevent it from overwhelming the proxy
quietDatabase waits for all data distribution to be completely finished so that databases are cached in a cleaner state
2018-01-14 12:50:52 -08:00
Evan Tschannen be643d6937 fix: the tlog did not cancel recovery properly when stopped 2018-01-12 17:18:14 -08:00
Evan Tschannen 3915d6825c we need to check the server list at a higher priority, because if we do not notice a storage server interface change for a long period of time, we will mark it as failed 2018-01-12 12:51:07 -08:00
Evan Tschannen de119f192d fixed a priority inversion where the tlog would prefer to copy data from the previous generation rather than make data durable (leading to being ratekeeper controlled) 2018-01-11 16:09:49 -08:00
Evan Tschannen 29ebb19388 Merge branch 'release-5.0' into release-5.1 2018-01-11 15:43:37 -08:00
Evan Tschannen 22e5a0b257 formatting 2018-01-11 14:44:09 -08:00
Evan Tschannen 173a8de3ed DBCoreState supports upgrades from 3.0 versions 2018-01-11 14:39:51 -08:00
A.J. Beamon 2f5073d00f Some visual studio project cleanup. 2018-01-10 10:07:18 -08:00
Evan Tschannen 022df3b91b backup and restore sometimes took too long in simulation 2018-01-09 17:26:42 -08:00
Evan Tschannen 645f68212b make timekeeper priority system immediate 2018-01-08 18:21:00 -08:00
Evan Tschannen 370e8a9903 fix: split metrics could fail an assert in a very rare scenario 2018-01-08 18:20:22 -08:00
Evan Tschannen 9630deba3a fixed a number of bugs related to running fearless without remote logs 2018-01-08 12:04:19 -08:00
Evan Tschannen d3116fb336 masterRecoveryDuration is only a sevWarnAlways outside of simulation 2018-01-07 15:37:45 -08:00
Evan Tschannen 4e8bc273b3 added a version of getKeyRangeLocations that checks for endpoint failures
fix: did not add the cluster controller to id_used in all cases
removed obsolete fixmes
2018-01-07 15:32:43 -08:00
Evan Tschannen 30710f7493 syncLogId was not necessary 2018-01-06 14:52:39 -08:00
Evan Tschannen 3ec45d38a0 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-06 13:54:45 -08:00
Evan Tschannen 10c3fc165e fix: after recovering from disk, only allow peeking data the was fully recovered 2018-01-06 13:49:13 -08:00
Stephen Atherton b86f68ceb8 Added new test that combines atomic backup/restore. Added randomization to delays in AtomicRestore workload. 2018-01-05 14:43:21 -08:00
Evan Tschannen 63751fb0e2 fix: remote logs are not in the log system until the recovery is complete so they cannot be used to determine if this is the correct log system to recover from 2018-01-05 14:15:25 -08:00
Evan Tschannen 5ac4f73978 Merge branch 'release-5.1' into feature-remote-logs
# Conflicts:
#	fdbclient/NativeAPI.actor.cpp
#	fdbrpc/Locality.h
#	fdbrpc/simulator.h
#	fdbserver/ApplyMetadataMutation.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
#	fdbserver/WorkerInterface.h
#	fdbserver/masterserver.actor.cpp
#	flow/Net2.actor.cpp
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-05 11:33:42 -08:00
A.J. Beamon 5015119115 Generalize the message that gets displayed in status if a cluster file's contents are incorrect. 2018-01-05 10:29:47 -08:00
Evan Tschannen e11f461cbd fix: better master exists needs to check master fitness before tlogs or proxies because that is the order of recruitment 2018-01-04 15:19:46 -08:00
Evan Tschannen f8f1c48d83 sometimes test pausing backups 2018-01-04 11:40:08 -08:00
Evan Tschannen f2c4beed9f fix: tlogFitness did not consider it better to have one tlog of a better fitness
fix: checkStable was not used in all places in better master exists
fix: we need to call checkOutstanding on worker registration in all cases
fix: in case persistentData is keyValueStoreMemory, we need to make sure it is fully recovered before writing to it
2018-01-04 11:33:02 -08:00
Evan Tschannen 6d5dd9bd27 fix: we cannot pipeline disk queue commits until after the first commit is successful 2018-01-02 13:30:27 -08:00
Evan Tschannen 86958cb08d Merge pull request #226 from cie/fix-taskBucket-unblockFuture
Modify TaskBucketCorrectness to support chain and multiple tasks
2017-12-20 18:00:54 -08:00
Yichi Chiang 91e5abeaa6 Modify TaskBucketCorrectness to support chain and multiple tasks 2017-12-20 17:02:49 -08:00
Alex Miller f70e3b9fe8 Add or change a bunch of comments to provide descriptions of function contracts.
This cleans up a bit of the VersionStamp DR work I did, and leaves hints and
advice for anyone who will be touching mutation applying code in the future.
2017-12-20 16:57:14 -08:00
Evan Tschannen 982f0dcb1e Merge pull request #222 from cie/alexmiller/drtimefix2
Fix yet another VersionStamp DR issue.
2017-12-20 15:09:23 -08:00
Alex Miller b5a6bc0ab7 Fix VersionStamp problems by instead adding a COMMIT_ON_FIRST_PROXY transaction option.
Simulation identified the fact that we can violate the
VersionStamps-are-always-increasing promise via the following series of events:

1. On proxy 0, dumpData adds commit requests to proxy 0's commit promise stream
2. To any proxy, a client submits the first transaction of abortBackup, which stops further dumpData calls on proxy 0.
3. To any proxy that is not proxy 0, submit a transaction that checks if it needs to upgrade the destination version.
4. The transaction from (3) is committed
5. Transactions from (1) are committed

This is possible because the dumpData transactions have no read conflict
ranges, and thus it's impossible to make them abort due to "conflicting"
transactions.  There's also no promise that if client C sends a commit to proxy
A, and later a client D sends a commit to proxy B, that B must log its commit
after A.  (We only promise that if C is told it was committed before D is told
it was committed, then A committed before B.)

There was a failed attempt to fix this problem.  We tried to add read conflict
ranges to dumpData transactions so that they could be aborted by "conflicting"
transactions.  However, this failed because this now means that dumpData
transactions require conflict resolution, and the stale read version that they
use can cause them to be aborted with a transaction_too_old error.
(Transactions that don't have read conflict ranges will never return
transaction_too_old, because with no reads, the read snapshot version is
effectively meaningless.)  This was never previously possible, so the existing
code doesn't retry commits, and to make things more complicated, the dumpData
commits must be applied in order.  This would require either adding
dependencies to transactions (if A is going to commit then B must also be/have
committed), which would be complicated, or submitting transactions with a fixed
read version, and replaying the failed commits with a higher read version once
we get a transaction_too_old error, which would unacceptably slow down the
maximum throughput of dumpData.

Thus, we've instead elected to add a special transaction option that bypasses
proxy load balancing for commits, and always commits against proxy 0.  We can
know for certain that after the transaction from (2) is committed, all of the
dumpData transactions that will be committed have been added to the commit
promise stream on proxy 0.  Thus, if we enqueue another transaction against
proxy 0, we can know that it will be placed into the promise stream after all
of the dumpData transactions, thus providing the semantics that we require:  no
dumpData transaction can commit after the destination version upgrade
transaction.
2017-12-20 15:04:04 -08:00
Stephen Atherton e0d9cea008 Merge branch 'master' into continuous-backup
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
#	fdbrpc/BlobStore.actor.cpp
2017-12-19 23:02:14 -08:00
Alex Miller c7dbd31a1e Refactoring: Create a common prefixRange and do UID->Key once in backup. 2017-12-19 17:17:50 -08:00
Alex Miller 1488c12c18 Simulation will return and error and print if any non-suppressed SevError events were logged.
This means that loops like `seed=1; while ./fdbserver -r simulation -s $seed;
do seed=$(($seed+1)); done` to find an example of an often failing test.  This
also means joshua will report ExitCode errors on anything that has a SevError
in the log.

As a part of this, we also implicitly downgrade any injected errors to SevWarnAlways.
2017-12-19 17:17:50 -08:00
Stephen Atherton e28641886d TraceEvent improvements. Minor bug fix, restore log writing tasks didn't have the log file endVersion but it's only for logging purposes. 2017-12-19 15:27:04 -08:00
Evan Tschannen a5601877b3 fix: valgrind issue with destruction ordering 2017-12-18 15:31:59 -08:00
Evan Tschannen 1dc9eceb6d optimize GetKeyLocationRequests on the proxy so they only require a single map lookup, instead of doing 3 + (3* [number of ranges]) lookups 2017-12-15 20:13:44 -08:00
Stephen Atherton 33f9f1a95c Added SnapshotDispatch task for writing snapshots in random order over a specified period of time and adapting speed to a growing or shrinking database. TaskBucket now supports scheduling tasks. TaskFuture now correctly recognizes multiple tasks in its callback space. TaskBucket extendTimeout() now supports specifying the new timeout version. Submitting a backup now requires a snapshot duration. 2017-12-14 01:44:38 -08:00
Evan Tschannen 7ce93426ed fix: connection disabler in removeServerSafely needs to run for the whole test to avoid getting stuck on include all 2017-12-12 18:38:57 -08:00
Alec Grieser 4495a19299 Merge pull request #220 from cie/alexmiller/flowprofcircus
Add class restrictions to CpuProfiler, and fix metric crash.
2017-12-11 14:13:22 -08:00
Evan Tschannen 73a0a07eac clients ask for key location information directly from the proxy, instead of reading it from the database 2017-12-09 16:10:22 -08:00
Alex Miller 48660e9ce5 Add class restrictions to CpuProfiler, and fix metric crash.
This change largely refactors away the old meaning of the value given to
flow_profiler, which was the number of machines that we'd be profiling, and
instead replaces it with the classes of processes to profile for the duration
of the test.  Most importantly, this means that one can profile in circus with
a configuration that has "ssd" in it, and the circus run will still complete
(as long as the argument isn't "storage").

And also finally add some other fixes I had to the same file to conditionally
change the name of the metric we're looking for to comply with what's actually
written.
2017-12-07 19:28:29 -08:00
Stephen Atherton abb2dd1ebc Merge pull request #214 from cie/alexmiller/fallocate
Use fallocate to zero ranges instead of writing zeroes
2017-12-06 13:47:40 -08:00
Evan Tschannen 5a947212ed fix: ensure all prior commits have completed before returning that a commit has committed from the disk queue 2017-12-06 12:31:07 -08:00
Stephen Atherton f8e89a40ac Bug fixes, take(1) is incorrect usage of FlowLock. 2017-12-04 10:25:47 -08:00
Evan Tschannen 49dac11a5f added a SevWarnAlways for when a disk queue file grows larger than 20GB 2017-12-01 15:05:17 -08:00
Evan Tschannen 482ac38ca6 added knobs so that the client failure monitoring update rate and the server failure monitoring update rate are separate knobs 2017-12-01 13:04:32 -08:00
Evan Tschannen c3918d892a do not use bandwidth splitting on the keyServer shard, lots of sets and clears to this shard generally means you do not want to create additional data distribution work 2017-11-30 18:28:16 -08:00
Alex Miller 196258080b Refactor zeroing a chunk of a file from DiskQueue into IAsyncFile.
If we're going to do the work to provide more optimized ways to zero files,
then I'd feel better with this being in a more common place, so that any other
zero-ers are likely to reuse it.  It also makes testing easier/more obvious.

Also, because it's needed for correctness, fix the aligned_alloc for OSX, which
wasn't aligned, and use an actually aligned allocation function.
2017-11-30 17:57:55 -08:00
Alex Miller c7a120c59d Rename IAsyncFile::incrementalDelete -> IAsyncFileSystem::incrementalDeleteFile.
`deleteFile` existed in IAsyncFileSystem, so an incremental delete function
seems to belong more as a virtual method on IAsyncFileSystem than a static
method on IAsyncFile, and the naming should match.

As long as we're here, change IAsyncFile to declare a virtual destructor, so
that it has good and proper C++ behavior.  I presume this is what was vaguely
intended by the default constructor definition that previously existed?
2017-11-30 17:19:10 -08:00
Evan Tschannen 7f72aa7de5 fix: a storage server does not ever need to rollback before a version restored from disk 2017-11-30 11:19:43 -08:00
Evan Tschannen e5a682948c Merge pull request #212 from cie/check-cluster-controller-desired-class
Check cluster controller using desired process class in consistency c…
2017-11-29 15:57:51 -08:00
Yichi Chiang 8ba0eaebff Check cluster controller using desired process class in consistency check 2017-11-29 15:09:23 -08:00
Evan Tschannen 8c51bc4ac4 fixed low latency tests in a way that gives us better test coverage 2017-11-28 18:20:29 -08:00
Evan Tschannen dc624a54dc fix: avoid flushing large queues in simulation when checking latency 2017-11-27 17:23:20 -08:00
Stephen Atherton 1b1c8e985a Merge branch 'master' into backup-container-refactor
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
2017-11-25 19:54:51 -08:00
Stephen Atherton 6695c9e6a2 Bug fixes and improvements to error handling and trace events. The most serious bug was that restore would start at the wrong version, possibly skipping early log and range files. 2017-11-25 00:46:16 -08:00
Alex Miller f19cb3bbbd Merge pull request #208 from cie/alexmiller/grvtfix
Fix the GRV performance regression
2017-11-17 15:00:44 -08:00
Yichi Chiang d9a98aa968 Remove commented code 2017-11-16 17:25:37 -08:00
Yichi Chiang 0d5dc15ac8 Fix double recoveries 2017-11-16 16:58:55 -08:00
Alex Miller e9412bbb11 Fix the GRV performance regression introduced by adding the policy engine to GRV calculations.
Construction of LocalityGroup from LocalityData is expensive, and the previous
code greatly ran afoul of that.  The policy engine does a large amount of
interning of strings and building compressed maps to make the expected many
future selectReplica calls cheap.  Unfortunately we don't call selectReplicas,
so much of this work is undesireable for us, and a large amount of CPU time is
spent doing this initialization work.

The new changes aggressively do the minimal LocalityGroup::add() calls
necessary, and make them as cheap as possibly by removing all elements from
LocalityData that don't need to be considered by the policy.

This optimization was also applied to the PeekCursor used during recovery,
which should speed recoveries up by a small amount.
2017-11-16 16:15:52 -08:00
Evan Tschannen ad456a939a Merge pull request #206 from cie/change-excluded-cluster-controller
Change excluded cluster controller
2017-11-15 17:28:33 -08:00
Yichi Chiang f96faf72d9 Add fullyRecoveredConfig for checking exclusions 2017-11-15 17:15:24 -08:00
Evan Tschannen 30464e943c Merge pull request #205 from cie/cleanup-spammy-traceevents
Cleanup spammy traceevents
2017-11-15 12:41:37 -08:00
Evan Tschannen e113dba0e3 added a new trace event tracking master recovery durations 2017-11-15 12:38:26 -08:00
Stephen Atherton a77162b53d Merge branch 'master' into backup-container-refactor
# Conflicts:
#	fdbclient/BackupAgent.h
#	fdbclient/FileBackupAgent.actor.cpp
#	fdbclient/KeyBackedTypes.h
2017-11-15 08:14:47 -08:00
Stephen Atherton 3dfaf13b67 IBackupContainer has been rewritten to be a logical interface for storing, reading, deleting, expiring, and querying backup data. The details of how the data is organized or stored is now hidden from users of the interface. Both the local and blobstore containers have been rewritten, the key changes being a multi level directory structure and no more use of temporary files or pseudo-symlinks in the blob store implementation. This refactor has a large impact radius as the previous backup container was just a thin wrapper that presented a single level list of files and offered no methods for managing or interpreting the file structure so all of that logic was spread around other places in the code base. This made moving to the new blob store schema very messy, and without this refactor further changes in the future would only be worse.
Several backup tasks have been cleaned up / simplified because they no longer need to manage the ‘raw’ structure of the backup.  The addition of IBackupFile and its finish() method simplified the log and range writer tasks.  Updated BlobStoreEndpoint to support now-required bucket creation and bucket listing prefix/delimiter options for finding common prefixes.  Added KeyBackedSet<T> type.  Moved JSONDoc to its own header.  Added platform::findFilesRecursively().

Still to do:  update command line tool to use new IBackupContainer interface, fix bugs in Restore startup.
2017-11-14 23:33:17 -08:00
Yichi Chiang df922bc973 Change excluded cluster controller 2017-11-14 13:57:37 -08:00
A.J. Beamon bb1297c686 Remove RkServerQueueInfo and RkTLogQueueInfo trace events, since this information is more or less already logged on the storage servers and tlogs. Update the quiet database check and magnesium to use the information from the logs and storage servers. 2017-11-14 12:59:42 -08:00
A.J. Beamon 3b952efb4e Remove events from cluster controller that get logged for roughly every worker upon recovery, master registration, etc. 2017-11-14 10:15:45 -08:00
A.J. Beamon 0fea5e9c2f Convert client_invalid_operation errors to ASSERTs. 2017-11-13 11:38:34 -08:00
A.J. Beamon cd085764f1 Do not automatically change a cluster file that does not match what you expect. 2017-11-10 14:12:45 -08:00
Alex Miller 311d1ca87d A variety of fixes that collectively fix using flow profiling in circus.
To run, use --co=flow_profiling=-1, because reasons.
2017-11-07 13:55:16 -08:00
Evan Tschannen 706bf1e018 fix: we cannot trigger better master exists before a master is fully recovered because exclusions changed by the provisional master will not be committed until the master is fully recovered 2017-11-04 12:48:04 -07:00
Evan Tschannen 57aba0b3bc fix: excluded servers were the same fitness as storage servers for the master role
fix: better master exists did not considers exclusion for master fitness
2017-11-03 17:09:14 -07:00
Yichi Chiang 42fad5efe5 Introduce cluster controller process class in circus 2017-11-03 14:22:55 -07:00
Yichi Chiang dcc9aafab7 Merge branch 'master' of github.com:apple/foundationdb 2017-11-02 10:47:59 -07:00
Yichi Chiang c033d8efd8 Fix typo message and remove extra TraceEvent which overwrites the expected one 2017-11-02 10:47:51 -07:00
Balachandar Namasivayam 3efaaec479 onMasterProxiesChanged was being triggered when any member of ClientDBInfo changed. Change the behavior to be triggered only when proxies field in ClientDBInfo is changed. 2017-11-01 18:29:56 -07:00
A.J. Beamon 7cf17df821 Merge branch 'master' into log-group-for-unsupported-clients
# Conflicts:
#	flow/Net2.actor.cpp
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2017-11-01 11:31:02 -07:00
A.J. Beamon 31caac67dc Rename supported_versions[x].clients to supported_versions[x].connected_clients 2017-11-01 10:41:30 -07:00
Balachandar Namasivayam 988bc0207f Reset Client Transaction profiling parameters when the config keys are cleared. 2017-10-31 15:40:57 -07:00
Alec Grieser 5a4a5985fd Merge branch 'release-5.0' 2017-10-30 08:31:23 -07:00
Alec Grieser 87321f5017 Merge branch 'release-4.6' into release-5.0 2017-10-30 08:31:01 -07:00
Evan Tschannen 54d82c0d92 Merge pull request #194 from cie/alexmiller/valgrind
Fix valgrind errors
2017-10-27 17:25:12 -07:00
Alex Miller e0d33ef8d7 Preemptively fix profiler-related valgrind errors/straight out bugs.
I forgot to initialize some fields in requests.
2017-10-27 17:20:19 -07:00
Evan Tschannen aa0c2ae317 only increase the max shard size if the shard begins in the keyServer keyspace, do not increase the minimum shard size 2017-10-27 14:22:26 -07:00
Evan Tschannen 3a4078bdda the keyservers shards are always a fixed large size 2017-10-27 11:52:11 -07:00
Balachandar Namasivayam cfefab18fb Merge branch 'master' into add-new-atomic-ops 2017-10-25 18:03:34 -07:00
Balachandar Namasivayam 3d5658940a Addressed Review Comments 2017-10-25 16:42:05 -07:00
Balachandar Namasivayam 9dd588dcce Addressed review comments.
Changed naming for NewMin and NewAnd to MinV2 and AndV2
2017-10-25 14:48:05 -07:00
Evan Tschannen d852a53ae4 Merge pull request #181 from cie/throttle-spammy-logs
Throttle spammy logs
2017-10-25 13:45:55 -07:00
Balachandar Namasivayam 2f6d55a52f Add correctness tests for all atomic ops 2017-10-25 13:36:49 -07:00
Yichi Chiang 4d54a73f5b Merge pull request #191 from cie/count-cluster-controller-role
Take cluster controller role into consideration when recruiting workers
2017-10-25 12:09:15 -07:00
Yichi Chiang f39cce9b8d Use processId instead of address for comparison 2017-10-25 11:35:29 -07:00
Yichi Chiang 5fcef911f0 Take cluster controller role into consideration when recruiting workers 2017-10-25 10:35:46 -07:00
Evan Tschannen 48901a9223 added a list of tlog IDs that are missing to status 2017-10-24 16:28:50 -07:00
Yichi Chiang c2a117fe07 Merge pull request #189 from cie/enable-check-desired-class
Enable checkUsingDesiredClasses() in consistency check
2017-10-24 15:18:21 -07:00
Yichi Chiang defdc6550d Exclude excluded processses when getting testers 2017-10-24 15:16:34 -07:00
Evan Tschannen df74e2a373 re-added support for non-copying tlog recovery 2017-10-24 15:09:31 -07:00
Yichi Chiang 3865c5ae0e Enable checkUsingDesiredClasses() in consistency check 2017-10-24 12:58:54 -07:00
Balachandar Namasivayam 8c3bdc5b3b Make atomic ops differentiate between unset and empty values. 2017-10-23 16:48:13 -07:00
Evan Tschannen 7a36fd2134 disabled a variety of simulation tests to get correctness clean 2017-10-19 15:49:54 -07:00
Evan Tschannen e2c1e87df6 made a large number of fixes to make fearless DR correctness clean. 2017-10-19 15:36:32 -07:00
Bhaskar Muppana 360b777b78 Fail with correct error code in case of abort or discontinue of
non-existing backups.
2017-10-18 23:17:48 -07:00
Alec Grieser dd6d8f3b0e Merge branch 'master' into add-new-atomic-ops 2017-10-18 16:36:44 -07:00
Bhaskar Muppana 2007f3799f Don't ignore TimeKeeper failures. 2017-10-18 14:31:31 -07:00
Bhaskar Muppana 314511f4d7 Fixing spaces in BackupCorrectness TraceEvents. 2017-10-18 14:27:52 -07:00
Alex Miller 7b9bc1d715 Merge pull request #170 from cie/alexmiller/flowprofile
Add support for profiling a running fdb cluster to fdbcli, fix security issues, and add an improved backtrace.
2017-10-16 16:51:53 -07:00
Alex Miller f997cb9038 Add a string knob to hold the Log directory, and write profiles to it.
This is the combination of two small changes.

1. Add support for a string knob type.
2. Change profiles to be written to the log directory instead of the working
   directory.

We have three options of where to write files: the working directory, the data
directory, and the log directory.

The working directory may be set to a non-writable location, and likely
contains the fdb binaries.  Allowing these files to be overwritten would likely
not be a wise idea.

The data directory hosts our sqlite b-trees.  It would also be very unfortunate
if these were ever overwritten by an unfortunate profile name.

The log directory contains logs.  Out of the three, these matter the least if
they disappear or become corrupted.

Thus, we write to the log directory.
2017-10-16 16:05:02 -07:00
Alex Miller c5fbe33df6 Disallow arbitrary paths for storing profiles.
Previously, one could request profiles to be stored at
"../../../../../../etc/passwd".  Now we expand the paths, including symlinks,
and ensure that the target is a child of the targetted subdirectory.  This was
the least convoluted way I could figure out to handle paths.
2017-10-16 16:05:02 -07:00
Alex Miller 91a26a170c Add toggleable profiling support to fdbserver+fdbcli.
This adds the fdbcli commands:
* profile list -- Lists all workers in a way that doesn't fill `kill`'s list.
* profile flow run -- Allows starting flow profiling on a set of hosts for a specified interval.

And threads through all the support for enabling and disabling profiling as an RPC.
2017-10-16 16:05:02 -07:00
Balachandar Namasivayam 312f614133 Add the new ops and AND to NON_ASSOCIATIVE_MASK.
In the storage server, read the entire value if the op is ByteMin or ByteMax.
2017-10-16 11:06:31 -07:00
Alec Grieser e0be1ef1e0 Merge branch 'release-5.0' 2017-10-16 10:08:11 -07:00
Alec Grieser 432726ba2d Merge branch 'release-4.6' into release-5.0 2017-10-16 09:54:21 -07:00
Stephen Atherton 68eccb681e Merge pull request #173 from bmuppana/master
Backup log messages.
2017-10-13 18:31:53 -07:00
Evan Tschannen 215bcb8d3e Merge pull request #157 from cie/choose-leader-on-stateless-processes
Catch and update processClass change from DBSource
2017-10-13 14:03:29 -07:00
Yichi Chiang 5bcdd37c0d Move UID generation and add initialClass 2017-10-13 13:46:37 -07:00
Yichi Chiang 12edd27281 Introduce prevChangeID to CandidacyRequest and LeaderHeartbeatRequest 2017-10-12 17:11:58 -07:00
Bhaskar Muppana d1e9d28239 Backup log messages. 2017-10-12 16:12:42 -07:00
Stephen Atherton 11517f7bfc Merge branch 'master' into continuous-backup
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
2017-10-12 11:03:23 -07:00
Alex Miller c24b941485 Fix erroneous std::move in indexed set, and clean up addMetric users.
This is a follow-on to c4eb73d0.  Thanks to Bala for pointing out the unchanged
std::move usage, and there appeared to not be many existing users of addMetric
anyway.
2017-10-11 17:36:51 -07:00
Balachandar Namasivayam 8e0bea2795 Update API_VERSION from 500 to 510 2017-10-11 13:49:38 -07:00
Stephen Atherton c3d8412abb Merge pull request #166 from cie/alexmiller/deathservice
Fix potential division by zero issues via RPC.
2017-10-10 16:47:38 -07:00
Evan Tschannen ff1b49be2e Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DatabaseConfiguration.cpp
2017-10-10 16:07:59 -07:00
Evan Tschannen 8feb3b8fbc fixed conflict range workload by just disabling timeKeeper instead of the check, because it should be a more robust fix 2017-10-10 16:01:02 -07:00
Balachandar Namasivayam eeebf10030 Modified existing behavior of MIN and AND atomic ops. The new behavior results in a 'SET' if the atomic op is performed on a non -existing key.
Added new atomic ops ByteMin and ByteMax that does lexicographic comparison of byte strings.
2017-10-10 13:02:22 -07:00
Evan Tschannen c8525dc3e7 timekeeper is constantly changing keys in the system keyspace, so do not report errors on key mismatches on keys in the system keyspace 2017-10-10 12:04:56 -07:00
Evan Tschannen 3d2103075d data distribution tracks teams for each data center separately 2017-10-10 10:36:33 -07:00
Evan Tschannen 5e6eba365b fix: always set confChange, because popVersion is not deterministic across proxies, and confChange needs to be set deterministically 2017-10-06 18:37:08 -07:00
Evan Tschannen 93b3d0e4e7 fix: toMap didn’t report logs proxies and resolvers 2017-10-06 15:55:50 -07:00
Evan Tschannen 15962cf079 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbrpc/Locality.cpp
#	fdbrpc/Locality.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/ClusterRecruitmentInterface.h
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
#	fdbserver/WorkerInterface.h
#	fdbserver/fdbserver.vcxproj.filters
#	fdbserver/masterserver.actor.cpp
#	fdbserver/worker.actor.cpp
#	flow/error_definitions.h
2017-10-05 17:09:44 -07:00
Alex Miller a21c8a820b Move cpuProfilerRequest from WorkerInterface to ClientWorkerInterface.
A way to access this stream is required if we wish to be able to toggle
profiling from fdbcli.  There's two ways to do this:

1. Use `monitorLeader()` to get a `ClusterControllerFullInterface`, and use
`getWorkers` from there to get a list of `WorkerInterface`s, from which we can
access cpuProfilerRequest.
2. Move cpuProfilerRequest to ClientWorkerInterface and use the existing code
in the client that can fetch a list of all `ClientWorkerInterface`s.

The split between WorkerInterface and ClientWorkerInterface appears to be
what a client might have a need to call versus what is fdbserver-internal (and
thus no client should even want to call). Thus, it seems to make more sense to
acknowledge that profiling is useful to be able to toggle from a client, and go
with option (2).
2017-10-05 14:08:28 -07:00
Yichi Chiang 3edc2824a9 Add initialClass to RegisterWorkerRequest 2 2017-10-05 11:03:25 -07:00
Yichi Chiang 05f7626e39 Add initialClass to RegisterWorkerRequest 2017-10-04 17:11:12 -07:00
Yichi Chiang 3c70df57b5 Fix cluster controller review comments 2017-10-04 15:48:55 -07:00
Alex Miller e55cc447d2 Address code review comments.
* Fixed memory corruption with SystemData key constants
* Removed duplication in ClusterController
* Reworked fdbcli actions to better represent explicit vs default assignments
2017-10-04 13:36:18 -07:00
A.J. Beamon 5063793f36 Revert line ending change 2017-10-04 11:19:19 -07:00
Alex Miller 706427ee62 Fix potential division by zero issues via RPC.
A carefully crafted SplitMetricRequest could have caused division by zero.
It's not really great to offer Division By Zero As A Service, so let's just
return an error instead.
2017-10-03 22:11:08 -07:00
Evan Tschannen 3a2ddcc84a Add destinations that are read-write to the source list, so that cancelled data movement can contribute to copying the data for the next movement. 2017-10-03 17:39:08 -07:00
Balachandar Namasivayam 0e153cdd35 Throttle Spammy logs. Three knobs are added.
Trace Events are sampled and cached with an expiration set. Every TraceEvent above SevDebug is checked against this cache to see if it exceeded a set threshold. If yes, then throttle the TraceEvent.
If a TraceEvent is throttled, a warning msg is logged.
2017-10-02 18:43:11 -07:00
Evan Tschannen 6ea9903c82 Merge branch 'release-5.0'
# Conflicts:
#	fdbbackup/backup.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	versions.target
2017-10-01 18:46:44 -07:00
Evan Tschannen 0949c4be65 Revert "Fixed problem with master being recruited on excluded servers"
This reverts commit 1f7b624734a8ad6e896dd3f01f9cdf334ca62486.
2017-10-01 16:30:19 -07:00
Evan Tschannen 696d432462 Revert "fix: excluded servers are worst fit for master rather than never assign (so that we can recover if every process has been excluded)"
This reverts commit 83b2ce68c8e1a29fc1559598cc38d3ef7eb46101.
2017-10-01 16:29:32 -07:00
Evan Tschannen 0dde15f1d2 fix: excluded servers are worst fit for master rather than never assign (so that we can recover if every process has been excluded)
fix: better master exists did not use exclusions because the configuration was reset
2017-10-01 16:26:58 -07:00
Yichi Chiang 636ce4a131 Replace leader when find a better one 2017-09-29 16:34:55 -07:00
Alex Miller 11668bb359 Fixing code review comments. 2017-09-29 15:58:36 -07:00
Alex Miller b7ce9d996c Comment out verbose TraceEvents in preparation for pushing. 2017-09-29 15:58:36 -07:00
Alex Miller c40c1bb5fe Add a new workload: BackupToDBAbort, which does an ACI switchover.
This is to allower easier testing of non-durable switchovers without having to
wiggle into BackupToDBCorrectness's view of the world.
2017-09-29 15:58:36 -07:00
Alex Miller 9e9a96ae76 Make VersionStamp workload able to run with DR-style workloads.
* It is now tolerant of locked database errors, and handles them correctly.
* There is an option to specify which database to verify against.
2017-09-29 15:58:36 -07:00
Alex Miller 34630b6130 Make VersionStamp workload can handle commit_unknown_result.
Previously, if a transaction failed with commit_unknown_result, and was
actually committed, it would look like data that magically appeared in the
database and verification would fail.

Now, we explicitly re-read and check to see if the commit happened, so that we
may maintain an accurate understanding of what the database state should be.
2017-09-29 15:58:36 -07:00
Alex Miller 23945b9fea VersionStamp can co-exist with other workloads that write data to the database.
VersionStamp previously would range-read the entire database during validation.
This has the unfortunate effect of making it fail during validation if run with
any other workload that writes keys to the database.

Now, all keys written and read are done with a configurable prefix, so that it
may co-exist with a variety of other workloads.
2017-09-29 15:58:36 -07:00
Alex Miller 370a6afb80 Make VersionStamp have an option to be tolerant of data being lost. 2017-09-29 15:58:36 -07:00