Commit Graph

587 Commits

Author SHA1 Message Date
Evan Tschannen 4c89f721cd fix: do not include logRouter tags in lock results 2018-04-09 10:48:57 -07:00
Evan Tschannen 7af892f50b first working version of non-copying recovery working with fearless configurations 2018-04-08 21:24:05 -07:00
Alex Miller 0136a01c18 Fix "Not enough physical servers available" error due to incorrect server calculation. 2018-04-05 15:13:21 -07:00
Evan Tschannen bc938d9273 fix: storage recruitment could get stuck in a spin loop 2018-04-03 18:06:31 -07:00
Evan Tschannen 331e707684 fix: pop all tags that did not have data at the recovery version because fully popped tags may come back when pullAsyncData re-indexes the mutations 2018-03-31 16:47:56 -07:00
Evan Tschannen 96fffe2cea fix: do not update version if the log has been stopped 2018-03-30 22:11:42 -07:00
Evan Tschannen 4fb2b99341 fix: using only one region still means we need 3 machines per datacenter, the other machines in the other datacenters just won’t be used 2018-03-30 19:26:22 -07:00
Evan Tschannen 579ba58930 pop old tags only looks are recovered tags, and checks if they are still being used 2018-03-30 19:08:01 -07:00
Evan Tschannen 8352b93f48 fix: do not reuse tags that are still in historyTags, pop historyTags past epochEnd to allow tlogs to finish recovery
fix: peekLocal did not properly respect end
fix: the storage server added to the end of the history vector instead of the beginning
2018-03-30 17:39:45 -07:00
Evan Tschannen 43cb63df25 fix: the collectTags bool was set incorrectly 2018-03-29 18:19:29 -07:00
Evan Tschannen 1a4ded1c99 support upgrades by merging tags associated with the different peek requests 2018-03-29 17:54:08 -07:00
Evan Tschannen b36e08f08f first version of non-copying recovery. Upgrades are broken, and it has not been tested using fearless configurations yet 2018-03-29 15:12:38 -07:00
Evan Tschannen da737e1ea3 suppress the BestTeamStuck trace event 2018-03-26 18:32:32 -07:00
Evan Tschannen 82ed956c65 renamed the multi_dc configuration to three_datacenter. The old three_datacenter configuration was not a useful configuration. 2018-03-26 18:31:26 -07:00
Evan Tschannen b95e68eb5a fix: getDatabaseSize is really inefficient and causes slow tasks in the real world. Outside of simulation just assume the database is really large, because we only need the InvalidShardSize check in simulation 2018-03-26 17:35:11 -07:00
Alec Grieser bb5f3ebb6d
add router to help text for storage class of fdbserver 2018-03-26 13:26:56 -07:00
Evan Tschannen d3fb17d30a
Merge pull request #74 from bnamasivayam/client-profiling-tests
Client profiling tests - Part 1
2018-03-23 16:52:49 -07:00
Balachandar Namasivayam 1e719d79e9 Remove incorrect ASSERT's
Account for corner cases in missing chunks.
2018-03-23 15:51:56 -07:00
Evan Tschannen 5db52ab081
Merge pull request #87 from etschannen/feature-remote-logs
Feature remote logs
2018-03-23 12:55:17 -07:00
Evan Tschannen 7c48e1d31c
Update SimulatedCluster.actor.cpp 2018-03-23 12:54:44 -07:00
A.J. Beamon ddc0c613ed
Merge pull request #109 from apple/release-5.2
Merge Release 5.2 into master
2018-03-21 09:37:56 -07:00
Clement Pang 64deb0e0a1 Address review comments. 2018-03-20 14:38:04 -07:00
Clement Pang b46ffb4cbc Available space should take into account both memory and disk 2018-03-20 14:38:04 -07:00
Evan Tschannen 0746fe4d56 optimized tag lookups on the tlog by removing one level of vectors 2018-03-20 10:41:42 -07:00
Evan Tschannen d8e064d8bb fix: when a new log is recruited on a shared log, all outstanding commits need to be notified that they are stopped, because there is no longer a guarantee that their queueCommittedVersion will advance 2018-03-19 17:48:28 -07:00
Alec Grieser 551ea9c7f8
Merge remote-tracking branch 'upstream/release-5.2' into master-release-5.2-merge 2018-03-19 12:34:50 -07:00
yichic ede5cab192
Merge pull request #89 from yichic/share-log-mutations-5.2
Share log mutations 5.2
2018-03-19 12:01:26 -07:00
Yichi Chiang 1f2602d2b3 Fix all review comments 2018-03-19 11:33:33 -07:00
Yichi Chiang d6559b144f Share log mutations between backups and DRs which have the same backup range 2018-03-19 11:32:50 -07:00
Evan Tschannen 54be14000d do not deserialize tags 2018-03-17 11:24:18 -07:00
Evan Tschannen 4dcef08260 optimized the log router to use a vector instead of a map for tag data 2018-03-17 11:08:37 -07:00
Evan Tschannen 9c8cb445d6 optimized the tlog to use a vector for tags instead of a map 2018-03-17 10:36:19 -07:00
Evan Tschannen fecfea0f7d fix: messages vector was not cleared 2018-03-17 10:24:44 -07:00
Balachandar Namasivayam 9e3e3c8561 Add some sanity checks to deserialized data. 2018-03-16 18:45:25 -07:00
Yichi Chiang f12c1d811c Fix all review comments 2018-03-16 18:09:23 -07:00
Yichi Chiang 26b93ff920 Share log mutations between backups and DRs which have the same backup range 2018-03-16 18:09:23 -07:00
Evan Tschannen ccd70fd005 The tlog uses the tags embedded in the message instead of a separate vector of locations
optimized remote tlog committing to avoid re-serializing the message
2018-03-16 16:47:05 -07:00
Evan Tschannen 820382ea68 optimized the log router commit path to avoid re-serializing the data 2018-03-16 11:40:21 -07:00
Evan Tschannen a42205eb8e test running with only one region 2018-03-15 15:40:58 -07:00
Balachandar Namasivayam 89d7cc1093 Minor Bug fixes... 2018-03-15 11:00:47 -07:00
Evan Tschannen 82fb6424ec fix: storage recruitment could get stuck in a spin loop 2018-03-15 11:00:44 -07:00
Evan Tschannen 65b532658f added support for single region configurations 2018-03-15 10:59:30 -07:00
Alec Grieser 0853fcb052
switch to using zu for some size_t variables in printf 2018-03-14 18:07:05 -07:00
Evan Tschannen 59723f51f8 fix: continue to attempt to lock logs until remote logs are recovered, this is so that remote logs get locked and readers know they will not have any more data
do not throttle trace events in simulation
2018-03-14 12:39:55 -07:00
Balachandar Namasivayam 856d2a0a9d Add correctness tests for Client transaction profiling data format. It also includes format check across upgrades. 2018-03-14 12:39:50 -07:00
Alec Grieser 70a05c1a9b
fix some compiler whinges 2018-03-13 15:00:16 -07:00
Evan Tschannen 2e741057d4 use references instead of copying regionInfo 2018-03-13 12:59:07 -07:00
Evan Tschannen f6a22c1035 fix: the recovery actor was holding a copy of the tlogInterface after the tlog was removed 2018-03-12 16:56:34 -07:00
Evan Tschannen 72d56a700c fix: do not serialize an a tlog interface without a unique id 2018-03-10 09:52:09 -08:00
Evan Tschannen c74211bd92 fix: merge problem 2018-03-09 16:52:37 -08:00
Evan Tschannen 3abf4d7fdf Merge branch 'master' into feature-remote-logs 2018-03-09 14:50:04 -08:00
Evan Tschannen 91bb8faa45 Merge commit 'f773b9460d31d31b7d421860fc647936f31aa1fa'
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-03-09 14:47:03 -08:00
Evan Tschannen 28ea983487 Merge branch 'release-5.1' into release-5.2
# Conflicts:
#	flow/Trace.cpp
#	versions.target
2018-03-09 14:40:31 -08:00
A.J. Beamon bb9f51bb5c Don't try to extract attributes from the program start trace events if they couldn't be collected. 2018-03-09 11:55:57 -08:00
Evan Tschannen cf6dd1437b suppress spammy trace events 2018-03-09 10:16:34 -08:00
Evan Tschannen ae7d8e90b2 Merge branch 'release-5.1' of github.com:apple/foundationdb into release-5.1 2018-03-09 09:56:09 -08:00
Evan Tschannen 5390af8be4 suppress spammy logs 2018-03-09 09:40:36 -08:00
A.J. Beamon 1bf9f0ec6b
Merge pull request #54 from etschannen/release-5.1
fix: new cluster controllers should not consider anything failed unti…
2018-03-09 09:28:21 -08:00
Evan Tschannen f9625f5b2f fix: new cluster controllers should not consider anything failed until they have time to get failure monitoring updates
fix: storage and log class machines wait 100MS before attempting to become the cluster controller
2018-03-08 18:08:41 -08:00
Balachandar Namasivayam e7309a3535 Add trace events to print the ranges in ConsistencyCheck. 2018-03-08 13:53:59 -08:00
Evan Tschannen cf9d02cdbd
Merge pull request #48 from apple/release-5.2
Merge release-5.2 into master
2018-03-08 13:21:26 -08:00
A.J. Beamon 2c92ef8ff8
Merge pull request #47 from apple/release-5.1
Merge Release 5.1 into Release 5.2
2018-03-08 13:18:45 -08:00
A.J. Beamon 73cec8abad Merge branch 'release-5.1' of github.com:apple/foundationdb into release-5.1 2018-03-08 11:47:44 -08:00
Balachandar Namasivayam 4f58bca66a Simple refactor of code... 2018-03-08 11:34:25 -08:00
Balachandar Namasivayam 1c1a497ea2 Refactor getKeyServers to be more readable.
Fix possible memory corruption by returning KeyRange instead of KeyRangeRef in getKeyServers.
Simplify getMasterProxies on DatabaseContext class.
2018-03-08 11:34:18 -08:00
Balachandar Namasivayam 03a40354e3 Having 1000 as the limit for Limit for GetKeyServerLocationsRequest sometimes generate large packet warnings. Reduce it to 100.
Fix the bug where some of the key server shards may not be fetched.
2018-03-08 11:34:11 -08:00
A.J. Beamon fdcaf473ae Don't pass a copy of the StorageServerInterface to storageServerRollbackRebooter. This prevents a situation where the storage server has terminated but the request streams are left open until the underlying KV-store gets closed. 2018-03-08 11:14:24 -08:00
Evan Tschannen fa7eaea7cf fix: shards affected by team failure did not properly handle separate teams for the remote and primary data centers 2018-03-08 10:50:05 -08:00
bnamasivayam f838bc077e
Merge pull request #36 from ajbeamon/release-5.2
Set the address in consistency check processes…
2018-03-07 15:00:14 -08:00
Evan Tschannen 9d4cdc828b fix: inactive cursors are still useful if their version is larger than the current version 2018-03-07 12:54:53 -08:00
Evan Tschannen 68606c7984 fix: sim2 logic for when a kill is safe was incorrect 2018-03-06 18:38:05 -08:00
Alec Grieser 2a2ac56529
Merge pull request #22 from alecgrieser/37844532-expose-append-if-fits
Expose APPEND_IF_FITS to clients
2018-03-06 16:31:36 -08:00
Evan Tschannen 8c88041608 fix: we must commit to the number of log routers we are going to use when recruiting the primary, because it determines the number of log router tags that will be attached to mutations 2018-03-06 16:31:21 -08:00
A.J. Beamon 232bd496bf Set the address in consistency check processes in the same way we set it for clients so that it shows up in trace logs. Disallow specifying a public address for consistency check processes. 2018-03-06 15:40:04 -08:00
A.J. Beamon 7f8f655b9c Revert "Fix build errors"
This reverts commit 51804f0504.
2018-03-06 10:28:39 -08:00
A.J. Beamon f2c804e14f Reverting changes from merge of master into release-5.2 (b25810711c). Note that we never intend to release master into release-5.2, but if we did we would need to revert this commit. 2018-03-06 10:15:04 -08:00
Evan Tschannen 1194e3a361 added region-based configuration to support a large variety of fearless setups. Currently only 1 primary 1 remote setups are allowed. 2018-03-05 19:27:46 -08:00
Balachandar Namasivayam aea1f7ba21 Add tests for Client Transaction Profiling correctness 2018-03-05 18:55:23 -08:00
Balachandar Namasivayam 51804f0504 Fix build errors 2018-03-05 15:18:14 -08:00
A.J. Beamon b25810711c
Merge branch 'master' into release-5.2 2018-03-05 10:32:57 -08:00
Balachandar Namasivayam 8ae640c062 Addressed review comments. 2018-03-02 17:56:49 -08:00
Alec Grieser 218b7a41e2 add APPEND_IF_FITS to workload and remove guard ; add command to vexillographer 2018-03-02 17:43:39 -08:00
Balachandar Namasivayam 11df1aeabf Add new api to get shared tlogs id and address 2018-03-02 16:50:30 -08:00
Evan Tschannen 470f5c01f3 changed remoteDcId to a vector of ids, to support future configurations where there are multiple remote databases 2018-02-26 17:09:09 -08:00
Evan Tschannen a67296b373 do not test fearless configurations to merge with master 2018-02-26 13:31:06 -08:00
Evan Tschannen 8e966fdf9c simulated cluster tests all configurations. Still needs to randomize the remote and satellite replication, along with them number of remote tlogs, log routers, and satellite tlogs 2018-02-26 13:15:44 -08:00
Evan Tschannen e3c6b66240 fix: do not commit more data after being stopped
fix: prioritize dc locality above exclusion to prevent being stuck after excluding all machines in a data center
2018-02-26 13:13:37 -08:00
Evan Tschannen 37a6a81634 Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
# Conflicts:
#	fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00
Evan Tschannen cfcf98cffc fix: log router tags were not stored at a best location 2018-02-23 12:26:19 -08:00
Evan Tschannen a49e43000e fix: did not peek from log routers correctly 2018-02-22 16:13:56 -08:00
Evan Tschannen 719bb5bd0c
Merge pull request #4 from bnamasivayam/getKeyServers-refactor
Having 1000 as the limit for Limit for GetKeyServerLocationsRequest s…
2018-02-22 12:39:48 -08:00
Balachandar Namasivayam 2fe2b522d5 Simple refactor of code... 2018-02-22 12:38:14 -08:00
Alec Grieser e1162e9238 Merge remote-tracking branch 'upstream/release-5.1' 2018-02-22 11:16:12 -08:00
Balachandar Namasivayam e2030db5a8 Refactor getKeyServers to be more readable.
Fix possible memory corruption by returning KeyRange instead of KeyRangeRef in getKeyServers.
Simplify getMasterProxies on DatabaseContext class.
2018-02-21 17:11:50 -08:00
Evan Tschannen 2aa273df96 addStorageServer was advancing tags too much because of read errors 2018-02-21 17:05:39 -08:00
Evan Tschannen 310f56d98a fix: tlogs was resized incorrectly 2018-02-21 15:28:02 -08:00
Evan Tschannen ddb484143c fix: do not peek from remote logs if they are not fully recovered 2018-02-21 14:06:44 -08:00
Alec Grieser 0bae9880f1 remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py 2018-02-21 10:25:11 -08:00
Balachandar Namasivayam 6218934c7b Having 1000 as the limit for Limit for GetKeyServerLocationsRequest sometimes generate large packet warnings. Reduce it to 100.
Fix the bug where some of the key server shards may not be fetched.
2018-02-20 17:41:34 -08:00
Evan Tschannen 1dc6a8d4bd fix: the tlog can peek from log systems that have been recovered even if it does not match its recoverFrom set 2018-02-20 14:50:13 -08:00
Alec Grieser aadc06de99 Merge remote-tracking branch 'upstream/release-5.1' 2018-02-20 14:28:29 -08:00
Evan Tschannen 9ea963ddd6 fix: the master did not detect core state changes if it changed while writing
fix: do not attempt to use three_data_hall when in a fearless deployment
fix: log router tags are ephemeral and can be cleared after every recovery
2018-02-19 16:49:57 -08:00
Evan Tschannen 1b5628d2c5 testing a single configured fearless setup in simulated cluster
consolidated simulation connection disablers into one call in the tester
automatically reconfigure from a fearless setup in simulation
2018-02-18 12:59:43 -08:00
Evan Tschannen 31b89a638f added satellite_none and remote_none options to unconfigure from a fearless setup
fix: log_router configuration was broken
2018-02-17 13:51:17 -08:00
Stephen Atherton 54fc81b260 Improved backup error reporting in backup status. The most recent error for each error type is reported along with how long ago the error occurred, and errors are divided into two categories based on whether or not they occurred since the most recent backup progress. 2018-02-16 19:38:31 -08:00
Evan Tschannen dc93759e15 suppressed trace events that are spammy 2018-02-16 16:01:19 -08:00
Evan Tschannen cb25564d38 simulated cluster supports fearless configurations
removed unused simulation variables
run the simulation with only 1 coordinator most of the time, since we protect the coordinator from being killed, and protecting too many things is bad for simulation
2018-02-15 18:32:39 -08:00
Evan Tschannen ad19d3926b fix: make sure there are enough machines in each dc to support triple replication for the configure workload 2018-02-14 17:06:22 -08:00
Evan Tschannen 5303962af6 re-enabled configure database and remove servers safely, even though they do not work with fearless 2018-02-14 16:07:23 -08:00
Evan Tschannen ead3892e77 fix: prevent fast spin for future version 2018-02-14 15:16:18 -08:00
Evan Tschannen 110309272c fix: do not count a server as read-write unless it has a recent version, because it could have been readable a long time ago 2018-02-14 15:09:19 -08:00
A.J. Beamon 3300c2efed Enable slow task profiling in the consistency check processes. 2018-02-14 09:50:12 -08:00
Evan Tschannen d2b0c07558 storage servers continue to attempt to pop old tags after the log system updates 2018-02-13 18:34:13 -08:00
Evan Tschannen 1fedcba890 fix: do not use log router tags when configured without remote logs
fix: data distribution tracks undesired storage servers
re-enabled consistency check
2018-02-13 17:01:34 -08:00
Evan Tschannen a52ea4eb78 restored 5.1 functionality of simulated cluster. Will test assigned primary and remote data centers. Does not test remote replication or satellite logs 2018-02-10 13:27:51 -08:00
Evan Tschannen 42405c78a5 Merge commit '4038bd2fd968d88861f2cebd442ce511724816cb' into feature-remote-logs
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/Knobs.cpp
2018-02-10 12:08:52 -08:00
Evan Tschannen fbadcc6eea changing a storage server’s tag must be the first mutations applied in a version, because privatized mutations applied earlier in the same version will use the old tag 2018-02-09 18:21:29 -08:00
Evan Tschannen c7b3be5b19 re-enabled better master exists
the cluster controller can choose a better data center for itself and let the workers know where the next cluster controller should be recruited
2018-02-09 16:48:55 -08:00
Stephen Atherton acb876d520 Merge branch 'release-5.1' 2018-02-07 15:11:52 -08:00
Evan Tschannen d0caffd339 fix: knob was set to incorrect value 2018-02-06 18:11:45 -08:00
Stephen Atherton 3a49211c44 Merge branch 'release-5.1' 2018-02-06 13:58:35 -08:00
Stephen Atherton 7de40413d5 Merge branch 'release-5.1' of github.com:apple/foundationdb into release-5.1 2018-02-06 13:44:25 -08:00
Stephen Atherton 0792d5e3dd Fix: last restorable version for a backup tag name (a separate value from the latest restorable version for a configured backup) was not being updated.
Fix: backup blob speed was sometimes an error because the JSON $sum merge operator did not support mixed numeric types.
Fix: JSON merge operator handling was squashing errors in some cases, which was generally obscuring the backup speed metric issue.
Cleaned up some of the JSON object merging logic.
Improved error messages in JSON merge operators.  Added JSON merge operator tests for mixed numeric math and improved readability of test output.
2018-02-06 13:44:04 -08:00
Evan Tschannen b7dde88029 fix: the cluster controller did not consider the master sharing the same process as the cluster controller as bad in all needed locations
waited too long for good recruitment locations, which would add too much time to recoveries of clusters that do not use machine classes
2018-02-06 11:30:05 -08:00
Evan Tschannen 63a9f2aed6 fix: history tags were being incorrectly popped
fix: history tags were not cleared when a storage server was removed
2018-02-03 12:20:18 -08:00
Evan Tschannen ebd94bb654 removed a separately configurable storage team size for the remote data center, because it did not make sense
fix: the master did not monitor for the failure of remote logs
stop merge attempts when a data center is failed
fixed a variety of other problems with data distribution when a data center is failed
2018-02-02 11:46:04 -08:00
Evan Tschannen 766964ff48 fix: dest tags were not repopulated when the tag cache was cleared 2018-01-31 17:35:48 -08:00
A.J. Beamon 0c601d6f85 Purge past version references 2018-01-31 12:05:41 -08:00
Evan Tschannen 6b54d56ca7 gracefully exit if attempting to upgrade from 4.X versions 2018-01-30 17:10:50 -08:00
Evan Tschannen b48d8ce96d getTeam will return an unhealthy exact match if all teams are unhealthy. Resubmit relocation requests once healthy teams are available 2018-01-30 17:00:51 -08:00
Evan Tschannen 4160765fa1 added a buggify which reboots a server immediately after it has changed its locality 2018-01-29 18:21:28 -08:00
Evan Tschannen af97a512f5 to support more complicated policies in the future for determining the best location for a tag within a set of tlogs, use an integer instead of a bool 2018-01-29 17:48:18 -08:00
Evan Tschannen 497bc3fe83 fix: txsTag needs to choose the same best location as 5.X version of the software 2018-01-29 17:09:35 -08:00
Evan Tschannen 29c5d4ad3d upgrades from 5.X mostly supported, still some remaining correctness problems 2018-01-28 11:52:54 -08:00
Evan Tschannen 79d94214a4 Merge commit 'f4ffc9752b5ec66ac47f5f684a5d8be06a7eae6e' into feature-remote-logs 2018-01-25 10:12:06 -08:00
A.J. Beamon 2744646090 Merge branch 'release-5.0' into release-5.1 2018-01-22 11:57:58 -08:00
A.J. Beamon 188562ccbc fix: Status should create its DatabaseConfiguration using fromKeyValues(). This makes sure that various state is correctly set if not specified in the configuration. 2018-01-22 11:40:08 -08:00
Evan Tschannen 66b2218989 added tlog support for upgrading from 5.X clusters. Does not support upgrading from 4.X or earlier. Untested, storage servers still need the ability to change their tag. 2018-01-21 12:21:46 -08:00
Evan Tschannen 698ef4117e Merge branch 'master' into feature-remote-logs 2018-01-20 10:34:30 -08:00
Evan Tschannen b5eba4f13a fix: do not check for desired data centers if they have not been set 2018-01-20 10:28:59 -08:00
A.J. Beamon 35b91bfb55 Add back (in different form) some ratekeeper trace events when a storage server or log doesn't respond. Add actualTPS (named TPSBasis) to RkUpdate. 2018-01-18 14:51:38 -08:00
Evan Tschannen b78e0a362a fix: do not pause when running multiple backup tests simultaneously 2018-01-18 12:24:33 -08:00
Evan Tschannen 2e46ee3dba fix: getTeam works when there are no teams 2018-01-17 17:49:13 -08:00
Evan Tschannen 264dc44dfa fixed many more bugs associated with running without remote logs 2018-01-17 17:03:17 -08:00
Stephen Atherton 93b34a945f Major usability and performance improvements to backup management. Backup descriptions now calculate and display timestamps using TimeKeeper data (if given a cluster) and restorability of snapshots. Expire now requires a --force option to leave a backup unrestorable or unrestorable after a given point in time, specified by version or timestamp. BackupContainerFilesystem now maintains metadata on key version boundaries in order to avoid large list operations for describe and expire operations. Blob parallel recursive list operations can now take a path (aka prefix) filter function. New describe and expire options are available in fdbbackup. 2018-01-17 04:09:43 -08:00
Evan Tschannen 8f58bdd1cd fixed a large number of problems related to running without remote logs 2018-01-16 18:12:40 -08:00
Evan Tschannen 316e200a0c fix: compilation errors after merge 2018-01-16 10:48:50 -08:00
Evan Tschannen 21482a45e1 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DBCoreState.h
#	fdbserver/LogSystem.h
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/TLogServer.actor.cpp
2018-01-14 13:40:24 -08:00
Evan Tschannen 645dc5ead6 warmRange needs to get a read version occasionally to prevent it from overwhelming the proxy
quietDatabase waits for all data distribution to be completely finished so that databases are cached in a cleaner state
2018-01-14 12:50:52 -08:00
Evan Tschannen be643d6937 fix: the tlog did not cancel recovery properly when stopped 2018-01-12 17:18:14 -08:00
Evan Tschannen 3915d6825c we need to check the server list at a higher priority, because if we do not notice a storage server interface change for a long period of time, we will mark it as failed 2018-01-12 12:51:07 -08:00
Evan Tschannen de119f192d fixed a priority inversion where the tlog would prefer to copy data from the previous generation rather than make data durable (leading to being ratekeeper controlled) 2018-01-11 16:09:49 -08:00
Evan Tschannen 29ebb19388 Merge branch 'release-5.0' into release-5.1 2018-01-11 15:43:37 -08:00
Evan Tschannen 22e5a0b257 formatting 2018-01-11 14:44:09 -08:00
Evan Tschannen 173a8de3ed DBCoreState supports upgrades from 3.0 versions 2018-01-11 14:39:51 -08:00
A.J. Beamon 2f5073d00f Some visual studio project cleanup. 2018-01-10 10:07:18 -08:00
Evan Tschannen 022df3b91b backup and restore sometimes took too long in simulation 2018-01-09 17:26:42 -08:00
Evan Tschannen 645f68212b make timekeeper priority system immediate 2018-01-08 18:21:00 -08:00
Evan Tschannen 370e8a9903 fix: split metrics could fail an assert in a very rare scenario 2018-01-08 18:20:22 -08:00
Evan Tschannen 9630deba3a fixed a number of bugs related to running fearless without remote logs 2018-01-08 12:04:19 -08:00
Evan Tschannen d3116fb336 masterRecoveryDuration is only a sevWarnAlways outside of simulation 2018-01-07 15:37:45 -08:00
Evan Tschannen 4e8bc273b3 added a version of getKeyRangeLocations that checks for endpoint failures
fix: did not add the cluster controller to id_used in all cases
removed obsolete fixmes
2018-01-07 15:32:43 -08:00
Evan Tschannen 30710f7493 syncLogId was not necessary 2018-01-06 14:52:39 -08:00
Evan Tschannen 3ec45d38a0 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-06 13:54:45 -08:00
Evan Tschannen 10c3fc165e fix: after recovering from disk, only allow peeking data the was fully recovered 2018-01-06 13:49:13 -08:00
Stephen Atherton b86f68ceb8 Added new test that combines atomic backup/restore. Added randomization to delays in AtomicRestore workload. 2018-01-05 14:43:21 -08:00
Evan Tschannen 63751fb0e2 fix: remote logs are not in the log system until the recovery is complete so they cannot be used to determine if this is the correct log system to recover from 2018-01-05 14:15:25 -08:00
Evan Tschannen 5ac4f73978 Merge branch 'release-5.1' into feature-remote-logs
# Conflicts:
#	fdbclient/NativeAPI.actor.cpp
#	fdbrpc/Locality.h
#	fdbrpc/simulator.h
#	fdbserver/ApplyMetadataMutation.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
#	fdbserver/WorkerInterface.h
#	fdbserver/masterserver.actor.cpp
#	flow/Net2.actor.cpp
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-05 11:33:42 -08:00
A.J. Beamon 5015119115 Generalize the message that gets displayed in status if a cluster file's contents are incorrect. 2018-01-05 10:29:47 -08:00
Evan Tschannen e11f461cbd fix: better master exists needs to check master fitness before tlogs or proxies because that is the order of recruitment 2018-01-04 15:19:46 -08:00
Evan Tschannen f8f1c48d83 sometimes test pausing backups 2018-01-04 11:40:08 -08:00
Evan Tschannen f2c4beed9f fix: tlogFitness did not consider it better to have one tlog of a better fitness
fix: checkStable was not used in all places in better master exists
fix: we need to call checkOutstanding on worker registration in all cases
fix: in case persistentData is keyValueStoreMemory, we need to make sure it is fully recovered before writing to it
2018-01-04 11:33:02 -08:00
Evan Tschannen 6d5dd9bd27 fix: we cannot pipeline disk queue commits until after the first commit is successful 2018-01-02 13:30:27 -08:00
Evan Tschannen 86958cb08d Merge pull request #226 from cie/fix-taskBucket-unblockFuture
Modify TaskBucketCorrectness to support chain and multiple tasks
2017-12-20 18:00:54 -08:00
Yichi Chiang 91e5abeaa6 Modify TaskBucketCorrectness to support chain and multiple tasks 2017-12-20 17:02:49 -08:00
Alex Miller f70e3b9fe8 Add or change a bunch of comments to provide descriptions of function contracts.
This cleans up a bit of the VersionStamp DR work I did, and leaves hints and
advice for anyone who will be touching mutation applying code in the future.
2017-12-20 16:57:14 -08:00
Evan Tschannen 982f0dcb1e Merge pull request #222 from cie/alexmiller/drtimefix2
Fix yet another VersionStamp DR issue.
2017-12-20 15:09:23 -08:00
Alex Miller b5a6bc0ab7 Fix VersionStamp problems by instead adding a COMMIT_ON_FIRST_PROXY transaction option.
Simulation identified the fact that we can violate the
VersionStamps-are-always-increasing promise via the following series of events:

1. On proxy 0, dumpData adds commit requests to proxy 0's commit promise stream
2. To any proxy, a client submits the first transaction of abortBackup, which stops further dumpData calls on proxy 0.
3. To any proxy that is not proxy 0, submit a transaction that checks if it needs to upgrade the destination version.
4. The transaction from (3) is committed
5. Transactions from (1) are committed

This is possible because the dumpData transactions have no read conflict
ranges, and thus it's impossible to make them abort due to "conflicting"
transactions.  There's also no promise that if client C sends a commit to proxy
A, and later a client D sends a commit to proxy B, that B must log its commit
after A.  (We only promise that if C is told it was committed before D is told
it was committed, then A committed before B.)

There was a failed attempt to fix this problem.  We tried to add read conflict
ranges to dumpData transactions so that they could be aborted by "conflicting"
transactions.  However, this failed because this now means that dumpData
transactions require conflict resolution, and the stale read version that they
use can cause them to be aborted with a transaction_too_old error.
(Transactions that don't have read conflict ranges will never return
transaction_too_old, because with no reads, the read snapshot version is
effectively meaningless.)  This was never previously possible, so the existing
code doesn't retry commits, and to make things more complicated, the dumpData
commits must be applied in order.  This would require either adding
dependencies to transactions (if A is going to commit then B must also be/have
committed), which would be complicated, or submitting transactions with a fixed
read version, and replaying the failed commits with a higher read version once
we get a transaction_too_old error, which would unacceptably slow down the
maximum throughput of dumpData.

Thus, we've instead elected to add a special transaction option that bypasses
proxy load balancing for commits, and always commits against proxy 0.  We can
know for certain that after the transaction from (2) is committed, all of the
dumpData transactions that will be committed have been added to the commit
promise stream on proxy 0.  Thus, if we enqueue another transaction against
proxy 0, we can know that it will be placed into the promise stream after all
of the dumpData transactions, thus providing the semantics that we require:  no
dumpData transaction can commit after the destination version upgrade
transaction.
2017-12-20 15:04:04 -08:00
Stephen Atherton e0d9cea008 Merge branch 'master' into continuous-backup
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
#	fdbrpc/BlobStore.actor.cpp
2017-12-19 23:02:14 -08:00
Alex Miller c7dbd31a1e Refactoring: Create a common prefixRange and do UID->Key once in backup. 2017-12-19 17:17:50 -08:00
Alex Miller 1488c12c18 Simulation will return and error and print if any non-suppressed SevError events were logged.
This means that loops like `seed=1; while ./fdbserver -r simulation -s $seed;
do seed=$(($seed+1)); done` to find an example of an often failing test.  This
also means joshua will report ExitCode errors on anything that has a SevError
in the log.

As a part of this, we also implicitly downgrade any injected errors to SevWarnAlways.
2017-12-19 17:17:50 -08:00
Stephen Atherton e28641886d TraceEvent improvements. Minor bug fix, restore log writing tasks didn't have the log file endVersion but it's only for logging purposes. 2017-12-19 15:27:04 -08:00
Evan Tschannen a5601877b3 fix: valgrind issue with destruction ordering 2017-12-18 15:31:59 -08:00
Evan Tschannen 1dc9eceb6d optimize GetKeyLocationRequests on the proxy so they only require a single map lookup, instead of doing 3 + (3* [number of ranges]) lookups 2017-12-15 20:13:44 -08:00
Stephen Atherton 33f9f1a95c Added SnapshotDispatch task for writing snapshots in random order over a specified period of time and adapting speed to a growing or shrinking database. TaskBucket now supports scheduling tasks. TaskFuture now correctly recognizes multiple tasks in its callback space. TaskBucket extendTimeout() now supports specifying the new timeout version. Submitting a backup now requires a snapshot duration. 2017-12-14 01:44:38 -08:00
Evan Tschannen 7ce93426ed fix: connection disabler in removeServerSafely needs to run for the whole test to avoid getting stuck on include all 2017-12-12 18:38:57 -08:00
Alec Grieser 4495a19299 Merge pull request #220 from cie/alexmiller/flowprofcircus
Add class restrictions to CpuProfiler, and fix metric crash.
2017-12-11 14:13:22 -08:00
Evan Tschannen 73a0a07eac clients ask for key location information directly from the proxy, instead of reading it from the database 2017-12-09 16:10:22 -08:00
Alex Miller 48660e9ce5 Add class restrictions to CpuProfiler, and fix metric crash.
This change largely refactors away the old meaning of the value given to
flow_profiler, which was the number of machines that we'd be profiling, and
instead replaces it with the classes of processes to profile for the duration
of the test.  Most importantly, this means that one can profile in circus with
a configuration that has "ssd" in it, and the circus run will still complete
(as long as the argument isn't "storage").

And also finally add some other fixes I had to the same file to conditionally
change the name of the metric we're looking for to comply with what's actually
written.
2017-12-07 19:28:29 -08:00
Stephen Atherton abb2dd1ebc Merge pull request #214 from cie/alexmiller/fallocate
Use fallocate to zero ranges instead of writing zeroes
2017-12-06 13:47:40 -08:00
Evan Tschannen 5a947212ed fix: ensure all prior commits have completed before returning that a commit has committed from the disk queue 2017-12-06 12:31:07 -08:00
Stephen Atherton f8e89a40ac Bug fixes, take(1) is incorrect usage of FlowLock. 2017-12-04 10:25:47 -08:00
Evan Tschannen 49dac11a5f added a SevWarnAlways for when a disk queue file grows larger than 20GB 2017-12-01 15:05:17 -08:00
Evan Tschannen 482ac38ca6 added knobs so that the client failure monitoring update rate and the server failure monitoring update rate are separate knobs 2017-12-01 13:04:32 -08:00
Evan Tschannen c3918d892a do not use bandwidth splitting on the keyServer shard, lots of sets and clears to this shard generally means you do not want to create additional data distribution work 2017-11-30 18:28:16 -08:00
Alex Miller 196258080b Refactor zeroing a chunk of a file from DiskQueue into IAsyncFile.
If we're going to do the work to provide more optimized ways to zero files,
then I'd feel better with this being in a more common place, so that any other
zero-ers are likely to reuse it.  It also makes testing easier/more obvious.

Also, because it's needed for correctness, fix the aligned_alloc for OSX, which
wasn't aligned, and use an actually aligned allocation function.
2017-11-30 17:57:55 -08:00
Alex Miller c7a120c59d Rename IAsyncFile::incrementalDelete -> IAsyncFileSystem::incrementalDeleteFile.
`deleteFile` existed in IAsyncFileSystem, so an incremental delete function
seems to belong more as a virtual method on IAsyncFileSystem than a static
method on IAsyncFile, and the naming should match.

As long as we're here, change IAsyncFile to declare a virtual destructor, so
that it has good and proper C++ behavior.  I presume this is what was vaguely
intended by the default constructor definition that previously existed?
2017-11-30 17:19:10 -08:00
Evan Tschannen 7f72aa7de5 fix: a storage server does not ever need to rollback before a version restored from disk 2017-11-30 11:19:43 -08:00
Evan Tschannen e5a682948c Merge pull request #212 from cie/check-cluster-controller-desired-class
Check cluster controller using desired process class in consistency c…
2017-11-29 15:57:51 -08:00
Yichi Chiang 8ba0eaebff Check cluster controller using desired process class in consistency check 2017-11-29 15:09:23 -08:00