Evan Tschannen
4c89f721cd
fix: do not include logRouter tags in lock results
2018-04-09 10:48:57 -07:00
Evan Tschannen
7af892f50b
first working version of non-copying recovery working with fearless configurations
2018-04-08 21:24:05 -07:00
Alex Miller
0136a01c18
Fix "Not enough physical servers available" error due to incorrect server calculation.
2018-04-05 15:13:21 -07:00
Evan Tschannen
bc938d9273
fix: storage recruitment could get stuck in a spin loop
2018-04-03 18:06:31 -07:00
Evan Tschannen
331e707684
fix: pop all tags that did not have data at the recovery version because fully popped tags may come back when pullAsyncData re-indexes the mutations
2018-03-31 16:47:56 -07:00
Evan Tschannen
96fffe2cea
fix: do not update version if the log has been stopped
2018-03-30 22:11:42 -07:00
Evan Tschannen
4fb2b99341
fix: using only one region still means we need 3 machines per datacenter, the other machines in the other datacenters just won’t be used
2018-03-30 19:26:22 -07:00
Evan Tschannen
579ba58930
pop old tags only looks are recovered tags, and checks if they are still being used
2018-03-30 19:08:01 -07:00
Evan Tschannen
8352b93f48
fix: do not reuse tags that are still in historyTags, pop historyTags past epochEnd to allow tlogs to finish recovery
...
fix: peekLocal did not properly respect end
fix: the storage server added to the end of the history vector instead of the beginning
2018-03-30 17:39:45 -07:00
Evan Tschannen
43cb63df25
fix: the collectTags bool was set incorrectly
2018-03-29 18:19:29 -07:00
Evan Tschannen
1a4ded1c99
support upgrades by merging tags associated with the different peek requests
2018-03-29 17:54:08 -07:00
Evan Tschannen
b36e08f08f
first version of non-copying recovery. Upgrades are broken, and it has not been tested using fearless configurations yet
2018-03-29 15:12:38 -07:00
Evan Tschannen
da737e1ea3
suppress the BestTeamStuck trace event
2018-03-26 18:32:32 -07:00
Evan Tschannen
82ed956c65
renamed the multi_dc configuration to three_datacenter. The old three_datacenter configuration was not a useful configuration.
2018-03-26 18:31:26 -07:00
Evan Tschannen
b95e68eb5a
fix: getDatabaseSize is really inefficient and causes slow tasks in the real world. Outside of simulation just assume the database is really large, because we only need the InvalidShardSize check in simulation
2018-03-26 17:35:11 -07:00
Alec Grieser
bb5f3ebb6d
add router to help text for storage class of fdbserver
2018-03-26 13:26:56 -07:00
Evan Tschannen
d3fb17d30a
Merge pull request #74 from bnamasivayam/client-profiling-tests
...
Client profiling tests - Part 1
2018-03-23 16:52:49 -07:00
Balachandar Namasivayam
1e719d79e9
Remove incorrect ASSERT's
...
Account for corner cases in missing chunks.
2018-03-23 15:51:56 -07:00
Evan Tschannen
5db52ab081
Merge pull request #87 from etschannen/feature-remote-logs
...
Feature remote logs
2018-03-23 12:55:17 -07:00
Evan Tschannen
7c48e1d31c
Update SimulatedCluster.actor.cpp
2018-03-23 12:54:44 -07:00
A.J. Beamon
ddc0c613ed
Merge pull request #109 from apple/release-5.2
...
Merge Release 5.2 into master
2018-03-21 09:37:56 -07:00
Clement Pang
64deb0e0a1
Address review comments.
2018-03-20 14:38:04 -07:00
Clement Pang
b46ffb4cbc
Available space should take into account both memory and disk
2018-03-20 14:38:04 -07:00
Evan Tschannen
0746fe4d56
optimized tag lookups on the tlog by removing one level of vectors
2018-03-20 10:41:42 -07:00
Evan Tschannen
d8e064d8bb
fix: when a new log is recruited on a shared log, all outstanding commits need to be notified that they are stopped, because there is no longer a guarantee that their queueCommittedVersion will advance
2018-03-19 17:48:28 -07:00
Alec Grieser
551ea9c7f8
Merge remote-tracking branch 'upstream/release-5.2' into master-release-5.2-merge
2018-03-19 12:34:50 -07:00
yichic
ede5cab192
Merge pull request #89 from yichic/share-log-mutations-5.2
...
Share log mutations 5.2
2018-03-19 12:01:26 -07:00
Yichi Chiang
1f2602d2b3
Fix all review comments
2018-03-19 11:33:33 -07:00
Yichi Chiang
d6559b144f
Share log mutations between backups and DRs which have the same backup range
2018-03-19 11:32:50 -07:00
Evan Tschannen
54be14000d
do not deserialize tags
2018-03-17 11:24:18 -07:00
Evan Tschannen
4dcef08260
optimized the log router to use a vector instead of a map for tag data
2018-03-17 11:08:37 -07:00
Evan Tschannen
9c8cb445d6
optimized the tlog to use a vector for tags instead of a map
2018-03-17 10:36:19 -07:00
Evan Tschannen
fecfea0f7d
fix: messages vector was not cleared
2018-03-17 10:24:44 -07:00
Balachandar Namasivayam
9e3e3c8561
Add some sanity checks to deserialized data.
2018-03-16 18:45:25 -07:00
Yichi Chiang
f12c1d811c
Fix all review comments
2018-03-16 18:09:23 -07:00
Yichi Chiang
26b93ff920
Share log mutations between backups and DRs which have the same backup range
2018-03-16 18:09:23 -07:00
Evan Tschannen
ccd70fd005
The tlog uses the tags embedded in the message instead of a separate vector of locations
...
optimized remote tlog committing to avoid re-serializing the message
2018-03-16 16:47:05 -07:00
Evan Tschannen
820382ea68
optimized the log router commit path to avoid re-serializing the data
2018-03-16 11:40:21 -07:00
Evan Tschannen
a42205eb8e
test running with only one region
2018-03-15 15:40:58 -07:00
Balachandar Namasivayam
89d7cc1093
Minor Bug fixes...
2018-03-15 11:00:47 -07:00
Evan Tschannen
82fb6424ec
fix: storage recruitment could get stuck in a spin loop
2018-03-15 11:00:44 -07:00
Evan Tschannen
65b532658f
added support for single region configurations
2018-03-15 10:59:30 -07:00
Alec Grieser
0853fcb052
switch to using zu for some size_t variables in printf
2018-03-14 18:07:05 -07:00
Evan Tschannen
59723f51f8
fix: continue to attempt to lock logs until remote logs are recovered, this is so that remote logs get locked and readers know they will not have any more data
...
do not throttle trace events in simulation
2018-03-14 12:39:55 -07:00
Balachandar Namasivayam
856d2a0a9d
Add correctness tests for Client transaction profiling data format. It also includes format check across upgrades.
2018-03-14 12:39:50 -07:00
Alec Grieser
70a05c1a9b
fix some compiler whinges
2018-03-13 15:00:16 -07:00
Evan Tschannen
2e741057d4
use references instead of copying regionInfo
2018-03-13 12:59:07 -07:00
Evan Tschannen
f6a22c1035
fix: the recovery actor was holding a copy of the tlogInterface after the tlog was removed
2018-03-12 16:56:34 -07:00
Evan Tschannen
72d56a700c
fix: do not serialize an a tlog interface without a unique id
2018-03-10 09:52:09 -08:00
Evan Tschannen
c74211bd92
fix: merge problem
2018-03-09 16:52:37 -08:00
Evan Tschannen
3abf4d7fdf
Merge branch 'master' into feature-remote-logs
2018-03-09 14:50:04 -08:00
Evan Tschannen
91bb8faa45
Merge commit 'f773b9460d31d31b7d421860fc647936f31aa1fa'
...
# Conflicts:
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2018-03-09 14:47:03 -08:00
Evan Tschannen
28ea983487
Merge branch 'release-5.1' into release-5.2
...
# Conflicts:
# flow/Trace.cpp
# versions.target
2018-03-09 14:40:31 -08:00
A.J. Beamon
bb9f51bb5c
Don't try to extract attributes from the program start trace events if they couldn't be collected.
2018-03-09 11:55:57 -08:00
Evan Tschannen
cf6dd1437b
suppress spammy trace events
2018-03-09 10:16:34 -08:00
Evan Tschannen
ae7d8e90b2
Merge branch 'release-5.1' of github.com:apple/foundationdb into release-5.1
2018-03-09 09:56:09 -08:00
Evan Tschannen
5390af8be4
suppress spammy logs
2018-03-09 09:40:36 -08:00
A.J. Beamon
1bf9f0ec6b
Merge pull request #54 from etschannen/release-5.1
...
fix: new cluster controllers should not consider anything failed unti…
2018-03-09 09:28:21 -08:00
Evan Tschannen
f9625f5b2f
fix: new cluster controllers should not consider anything failed until they have time to get failure monitoring updates
...
fix: storage and log class machines wait 100MS before attempting to become the cluster controller
2018-03-08 18:08:41 -08:00
Balachandar Namasivayam
e7309a3535
Add trace events to print the ranges in ConsistencyCheck.
2018-03-08 13:53:59 -08:00
Evan Tschannen
cf9d02cdbd
Merge pull request #48 from apple/release-5.2
...
Merge release-5.2 into master
2018-03-08 13:21:26 -08:00
A.J. Beamon
2c92ef8ff8
Merge pull request #47 from apple/release-5.1
...
Merge Release 5.1 into Release 5.2
2018-03-08 13:18:45 -08:00
A.J. Beamon
73cec8abad
Merge branch 'release-5.1' of github.com:apple/foundationdb into release-5.1
2018-03-08 11:47:44 -08:00
Balachandar Namasivayam
4f58bca66a
Simple refactor of code...
2018-03-08 11:34:25 -08:00
Balachandar Namasivayam
1c1a497ea2
Refactor getKeyServers to be more readable.
...
Fix possible memory corruption by returning KeyRange instead of KeyRangeRef in getKeyServers.
Simplify getMasterProxies on DatabaseContext class.
2018-03-08 11:34:18 -08:00
Balachandar Namasivayam
03a40354e3
Having 1000 as the limit for Limit for GetKeyServerLocationsRequest sometimes generate large packet warnings. Reduce it to 100.
...
Fix the bug where some of the key server shards may not be fetched.
2018-03-08 11:34:11 -08:00
A.J. Beamon
fdcaf473ae
Don't pass a copy of the StorageServerInterface to storageServerRollbackRebooter. This prevents a situation where the storage server has terminated but the request streams are left open until the underlying KV-store gets closed.
2018-03-08 11:14:24 -08:00
Evan Tschannen
fa7eaea7cf
fix: shards affected by team failure did not properly handle separate teams for the remote and primary data centers
2018-03-08 10:50:05 -08:00
bnamasivayam
f838bc077e
Merge pull request #36 from ajbeamon/release-5.2
...
Set the address in consistency check processes…
2018-03-07 15:00:14 -08:00
Evan Tschannen
9d4cdc828b
fix: inactive cursors are still useful if their version is larger than the current version
2018-03-07 12:54:53 -08:00
Evan Tschannen
68606c7984
fix: sim2 logic for when a kill is safe was incorrect
2018-03-06 18:38:05 -08:00
Alec Grieser
2a2ac56529
Merge pull request #22 from alecgrieser/37844532-expose-append-if-fits
...
Expose APPEND_IF_FITS to clients
2018-03-06 16:31:36 -08:00
Evan Tschannen
8c88041608
fix: we must commit to the number of log routers we are going to use when recruiting the primary, because it determines the number of log router tags that will be attached to mutations
2018-03-06 16:31:21 -08:00
A.J. Beamon
232bd496bf
Set the address in consistency check processes in the same way we set it for clients so that it shows up in trace logs. Disallow specifying a public address for consistency check processes.
2018-03-06 15:40:04 -08:00
A.J. Beamon
7f8f655b9c
Revert "Fix build errors"
...
This reverts commit 51804f0504
.
2018-03-06 10:28:39 -08:00
A.J. Beamon
f2c804e14f
Reverting changes from merge of master into release-5.2 ( b25810711c
). Note that we never intend to release master into release-5.2, but if we did we would need to revert this commit.
2018-03-06 10:15:04 -08:00
Evan Tschannen
1194e3a361
added region-based configuration to support a large variety of fearless setups. Currently only 1 primary 1 remote setups are allowed.
2018-03-05 19:27:46 -08:00
Balachandar Namasivayam
aea1f7ba21
Add tests for Client Transaction Profiling correctness
2018-03-05 18:55:23 -08:00
Balachandar Namasivayam
51804f0504
Fix build errors
2018-03-05 15:18:14 -08:00
A.J. Beamon
b25810711c
Merge branch 'master' into release-5.2
2018-03-05 10:32:57 -08:00
Balachandar Namasivayam
8ae640c062
Addressed review comments.
2018-03-02 17:56:49 -08:00
Alec Grieser
218b7a41e2
add APPEND_IF_FITS to workload and remove guard ; add command to vexillographer
2018-03-02 17:43:39 -08:00
Balachandar Namasivayam
11df1aeabf
Add new api to get shared tlogs id and address
2018-03-02 16:50:30 -08:00
Evan Tschannen
470f5c01f3
changed remoteDcId to a vector of ids, to support future configurations where there are multiple remote databases
2018-02-26 17:09:09 -08:00
Evan Tschannen
a67296b373
do not test fearless configurations to merge with master
2018-02-26 13:31:06 -08:00
Evan Tschannen
8e966fdf9c
simulated cluster tests all configurations. Still needs to randomize the remote and satellite replication, along with them number of remote tlogs, log routers, and satellite tlogs
2018-02-26 13:15:44 -08:00
Evan Tschannen
e3c6b66240
fix: do not commit more data after being stopped
...
fix: prioritize dc locality above exclusion to prevent being stuck after excluding all machines in a data center
2018-02-26 13:13:37 -08:00
Evan Tschannen
37a6a81634
Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
...
# Conflicts:
# fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00
Evan Tschannen
cfcf98cffc
fix: log router tags were not stored at a best location
2018-02-23 12:26:19 -08:00
Evan Tschannen
a49e43000e
fix: did not peek from log routers correctly
2018-02-22 16:13:56 -08:00
Evan Tschannen
719bb5bd0c
Merge pull request #4 from bnamasivayam/getKeyServers-refactor
...
Having 1000 as the limit for Limit for GetKeyServerLocationsRequest s…
2018-02-22 12:39:48 -08:00
Balachandar Namasivayam
2fe2b522d5
Simple refactor of code...
2018-02-22 12:38:14 -08:00
Alec Grieser
e1162e9238
Merge remote-tracking branch 'upstream/release-5.1'
2018-02-22 11:16:12 -08:00
Balachandar Namasivayam
e2030db5a8
Refactor getKeyServers to be more readable.
...
Fix possible memory corruption by returning KeyRange instead of KeyRangeRef in getKeyServers.
Simplify getMasterProxies on DatabaseContext class.
2018-02-21 17:11:50 -08:00
Evan Tschannen
2aa273df96
addStorageServer was advancing tags too much because of read errors
2018-02-21 17:05:39 -08:00
Evan Tschannen
310f56d98a
fix: tlogs was resized incorrectly
2018-02-21 15:28:02 -08:00
Evan Tschannen
ddb484143c
fix: do not peek from remote logs if they are not fully recovered
2018-02-21 14:06:44 -08:00
Alec Grieser
0bae9880f1
remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py
2018-02-21 10:25:11 -08:00
Balachandar Namasivayam
6218934c7b
Having 1000 as the limit for Limit for GetKeyServerLocationsRequest sometimes generate large packet warnings. Reduce it to 100.
...
Fix the bug where some of the key server shards may not be fetched.
2018-02-20 17:41:34 -08:00
Evan Tschannen
1dc6a8d4bd
fix: the tlog can peek from log systems that have been recovered even if it does not match its recoverFrom set
2018-02-20 14:50:13 -08:00
Alec Grieser
aadc06de99
Merge remote-tracking branch 'upstream/release-5.1'
2018-02-20 14:28:29 -08:00
Evan Tschannen
9ea963ddd6
fix: the master did not detect core state changes if it changed while writing
...
fix: do not attempt to use three_data_hall when in a fearless deployment
fix: log router tags are ephemeral and can be cleared after every recovery
2018-02-19 16:49:57 -08:00
Evan Tschannen
1b5628d2c5
testing a single configured fearless setup in simulated cluster
...
consolidated simulation connection disablers into one call in the tester
automatically reconfigure from a fearless setup in simulation
2018-02-18 12:59:43 -08:00
Evan Tschannen
31b89a638f
added satellite_none and remote_none options to unconfigure from a fearless setup
...
fix: log_router configuration was broken
2018-02-17 13:51:17 -08:00
Stephen Atherton
54fc81b260
Improved backup error reporting in backup status. The most recent error for each error type is reported along with how long ago the error occurred, and errors are divided into two categories based on whether or not they occurred since the most recent backup progress.
2018-02-16 19:38:31 -08:00
Evan Tschannen
dc93759e15
suppressed trace events that are spammy
2018-02-16 16:01:19 -08:00
Evan Tschannen
cb25564d38
simulated cluster supports fearless configurations
...
removed unused simulation variables
run the simulation with only 1 coordinator most of the time, since we protect the coordinator from being killed, and protecting too many things is bad for simulation
2018-02-15 18:32:39 -08:00
Evan Tschannen
ad19d3926b
fix: make sure there are enough machines in each dc to support triple replication for the configure workload
2018-02-14 17:06:22 -08:00
Evan Tschannen
5303962af6
re-enabled configure database and remove servers safely, even though they do not work with fearless
2018-02-14 16:07:23 -08:00
Evan Tschannen
ead3892e77
fix: prevent fast spin for future version
2018-02-14 15:16:18 -08:00
Evan Tschannen
110309272c
fix: do not count a server as read-write unless it has a recent version, because it could have been readable a long time ago
2018-02-14 15:09:19 -08:00
A.J. Beamon
3300c2efed
Enable slow task profiling in the consistency check processes.
2018-02-14 09:50:12 -08:00
Evan Tschannen
d2b0c07558
storage servers continue to attempt to pop old tags after the log system updates
2018-02-13 18:34:13 -08:00
Evan Tschannen
1fedcba890
fix: do not use log router tags when configured without remote logs
...
fix: data distribution tracks undesired storage servers
re-enabled consistency check
2018-02-13 17:01:34 -08:00
Evan Tschannen
a52ea4eb78
restored 5.1 functionality of simulated cluster. Will test assigned primary and remote data centers. Does not test remote replication or satellite logs
2018-02-10 13:27:51 -08:00
Evan Tschannen
42405c78a5
Merge commit '4038bd2fd968d88861f2cebd442ce511724816cb' into feature-remote-logs
...
# Conflicts:
# fdbserver/ClusterController.actor.cpp
# fdbserver/Knobs.cpp
2018-02-10 12:08:52 -08:00
Evan Tschannen
fbadcc6eea
changing a storage server’s tag must be the first mutations applied in a version, because privatized mutations applied earlier in the same version will use the old tag
2018-02-09 18:21:29 -08:00
Evan Tschannen
c7b3be5b19
re-enabled better master exists
...
the cluster controller can choose a better data center for itself and let the workers know where the next cluster controller should be recruited
2018-02-09 16:48:55 -08:00
Stephen Atherton
acb876d520
Merge branch 'release-5.1'
2018-02-07 15:11:52 -08:00
Evan Tschannen
d0caffd339
fix: knob was set to incorrect value
2018-02-06 18:11:45 -08:00
Stephen Atherton
3a49211c44
Merge branch 'release-5.1'
2018-02-06 13:58:35 -08:00
Stephen Atherton
7de40413d5
Merge branch 'release-5.1' of github.com:apple/foundationdb into release-5.1
2018-02-06 13:44:25 -08:00
Stephen Atherton
0792d5e3dd
Fix: last restorable version for a backup tag name (a separate value from the latest restorable version for a configured backup) was not being updated.
...
Fix: backup blob speed was sometimes an error because the JSON $sum merge operator did not support mixed numeric types.
Fix: JSON merge operator handling was squashing errors in some cases, which was generally obscuring the backup speed metric issue.
Cleaned up some of the JSON object merging logic.
Improved error messages in JSON merge operators. Added JSON merge operator tests for mixed numeric math and improved readability of test output.
2018-02-06 13:44:04 -08:00
Evan Tschannen
b7dde88029
fix: the cluster controller did not consider the master sharing the same process as the cluster controller as bad in all needed locations
...
waited too long for good recruitment locations, which would add too much time to recoveries of clusters that do not use machine classes
2018-02-06 11:30:05 -08:00
Evan Tschannen
63a9f2aed6
fix: history tags were being incorrectly popped
...
fix: history tags were not cleared when a storage server was removed
2018-02-03 12:20:18 -08:00
Evan Tschannen
ebd94bb654
removed a separately configurable storage team size for the remote data center, because it did not make sense
...
fix: the master did not monitor for the failure of remote logs
stop merge attempts when a data center is failed
fixed a variety of other problems with data distribution when a data center is failed
2018-02-02 11:46:04 -08:00
Evan Tschannen
766964ff48
fix: dest tags were not repopulated when the tag cache was cleared
2018-01-31 17:35:48 -08:00
A.J. Beamon
0c601d6f85
Purge past version references
2018-01-31 12:05:41 -08:00
Evan Tschannen
6b54d56ca7
gracefully exit if attempting to upgrade from 4.X versions
2018-01-30 17:10:50 -08:00
Evan Tschannen
b48d8ce96d
getTeam will return an unhealthy exact match if all teams are unhealthy. Resubmit relocation requests once healthy teams are available
2018-01-30 17:00:51 -08:00
Evan Tschannen
4160765fa1
added a buggify which reboots a server immediately after it has changed its locality
2018-01-29 18:21:28 -08:00
Evan Tschannen
af97a512f5
to support more complicated policies in the future for determining the best location for a tag within a set of tlogs, use an integer instead of a bool
2018-01-29 17:48:18 -08:00
Evan Tschannen
497bc3fe83
fix: txsTag needs to choose the same best location as 5.X version of the software
2018-01-29 17:09:35 -08:00
Evan Tschannen
29c5d4ad3d
upgrades from 5.X mostly supported, still some remaining correctness problems
2018-01-28 11:52:54 -08:00
Evan Tschannen
79d94214a4
Merge commit 'f4ffc9752b5ec66ac47f5f684a5d8be06a7eae6e' into feature-remote-logs
2018-01-25 10:12:06 -08:00
A.J. Beamon
2744646090
Merge branch 'release-5.0' into release-5.1
2018-01-22 11:57:58 -08:00
A.J. Beamon
188562ccbc
fix: Status should create its DatabaseConfiguration using fromKeyValues(). This makes sure that various state is correctly set if not specified in the configuration.
2018-01-22 11:40:08 -08:00
Evan Tschannen
66b2218989
added tlog support for upgrading from 5.X clusters. Does not support upgrading from 4.X or earlier. Untested, storage servers still need the ability to change their tag.
2018-01-21 12:21:46 -08:00
Evan Tschannen
698ef4117e
Merge branch 'master' into feature-remote-logs
2018-01-20 10:34:30 -08:00
Evan Tschannen
b5eba4f13a
fix: do not check for desired data centers if they have not been set
2018-01-20 10:28:59 -08:00
A.J. Beamon
35b91bfb55
Add back (in different form) some ratekeeper trace events when a storage server or log doesn't respond. Add actualTPS (named TPSBasis) to RkUpdate.
2018-01-18 14:51:38 -08:00
Evan Tschannen
b78e0a362a
fix: do not pause when running multiple backup tests simultaneously
2018-01-18 12:24:33 -08:00
Evan Tschannen
2e46ee3dba
fix: getTeam works when there are no teams
2018-01-17 17:49:13 -08:00
Evan Tschannen
264dc44dfa
fixed many more bugs associated with running without remote logs
2018-01-17 17:03:17 -08:00
Stephen Atherton
93b34a945f
Major usability and performance improvements to backup management. Backup descriptions now calculate and display timestamps using TimeKeeper data (if given a cluster) and restorability of snapshots. Expire now requires a --force option to leave a backup unrestorable or unrestorable after a given point in time, specified by version or timestamp. BackupContainerFilesystem now maintains metadata on key version boundaries in order to avoid large list operations for describe and expire operations. Blob parallel recursive list operations can now take a path (aka prefix) filter function. New describe and expire options are available in fdbbackup.
2018-01-17 04:09:43 -08:00
Evan Tschannen
8f58bdd1cd
fixed a large number of problems related to running without remote logs
2018-01-16 18:12:40 -08:00
Evan Tschannen
316e200a0c
fix: compilation errors after merge
2018-01-16 10:48:50 -08:00
Evan Tschannen
21482a45e1
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DBCoreState.h
# fdbserver/LogSystem.h
# fdbserver/LogSystemPeekCursor.actor.cpp
# fdbserver/TLogServer.actor.cpp
2018-01-14 13:40:24 -08:00
Evan Tschannen
645dc5ead6
warmRange needs to get a read version occasionally to prevent it from overwhelming the proxy
...
quietDatabase waits for all data distribution to be completely finished so that databases are cached in a cleaner state
2018-01-14 12:50:52 -08:00
Evan Tschannen
be643d6937
fix: the tlog did not cancel recovery properly when stopped
2018-01-12 17:18:14 -08:00
Evan Tschannen
3915d6825c
we need to check the server list at a higher priority, because if we do not notice a storage server interface change for a long period of time, we will mark it as failed
2018-01-12 12:51:07 -08:00
Evan Tschannen
de119f192d
fixed a priority inversion where the tlog would prefer to copy data from the previous generation rather than make data durable (leading to being ratekeeper controlled)
2018-01-11 16:09:49 -08:00
Evan Tschannen
29ebb19388
Merge branch 'release-5.0' into release-5.1
2018-01-11 15:43:37 -08:00
Evan Tschannen
22e5a0b257
formatting
2018-01-11 14:44:09 -08:00
Evan Tschannen
173a8de3ed
DBCoreState supports upgrades from 3.0 versions
2018-01-11 14:39:51 -08:00
A.J. Beamon
2f5073d00f
Some visual studio project cleanup.
2018-01-10 10:07:18 -08:00
Evan Tschannen
022df3b91b
backup and restore sometimes took too long in simulation
2018-01-09 17:26:42 -08:00
Evan Tschannen
645f68212b
make timekeeper priority system immediate
2018-01-08 18:21:00 -08:00
Evan Tschannen
370e8a9903
fix: split metrics could fail an assert in a very rare scenario
2018-01-08 18:20:22 -08:00
Evan Tschannen
9630deba3a
fixed a number of bugs related to running fearless without remote logs
2018-01-08 12:04:19 -08:00
Evan Tschannen
d3116fb336
masterRecoveryDuration is only a sevWarnAlways outside of simulation
2018-01-07 15:37:45 -08:00
Evan Tschannen
4e8bc273b3
added a version of getKeyRangeLocations that checks for endpoint failures
...
fix: did not add the cluster controller to id_used in all cases
removed obsolete fixmes
2018-01-07 15:32:43 -08:00
Evan Tschannen
30710f7493
syncLogId was not necessary
2018-01-06 14:52:39 -08:00
Evan Tschannen
3ec45d38a0
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-06 13:54:45 -08:00
Evan Tschannen
10c3fc165e
fix: after recovering from disk, only allow peeking data the was fully recovered
2018-01-06 13:49:13 -08:00
Stephen Atherton
b86f68ceb8
Added new test that combines atomic backup/restore. Added randomization to delays in AtomicRestore workload.
2018-01-05 14:43:21 -08:00
Evan Tschannen
63751fb0e2
fix: remote logs are not in the log system until the recovery is complete so they cannot be used to determine if this is the correct log system to recover from
2018-01-05 14:15:25 -08:00
Evan Tschannen
5ac4f73978
Merge branch 'release-5.1' into feature-remote-logs
...
# Conflicts:
# fdbclient/NativeAPI.actor.cpp
# fdbrpc/Locality.h
# fdbrpc/simulator.h
# fdbserver/ApplyMetadataMutation.h
# fdbserver/ClusterController.actor.cpp
# fdbserver/LogSystemPeekCursor.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/SimulatedCluster.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/TagPartitionedLogSystem.actor.cpp
# fdbserver/WorkerInterface.h
# fdbserver/masterserver.actor.cpp
# flow/Net2.actor.cpp
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-05 11:33:42 -08:00
A.J. Beamon
5015119115
Generalize the message that gets displayed in status if a cluster file's contents are incorrect.
2018-01-05 10:29:47 -08:00
Evan Tschannen
e11f461cbd
fix: better master exists needs to check master fitness before tlogs or proxies because that is the order of recruitment
2018-01-04 15:19:46 -08:00
Evan Tschannen
f8f1c48d83
sometimes test pausing backups
2018-01-04 11:40:08 -08:00
Evan Tschannen
f2c4beed9f
fix: tlogFitness did not consider it better to have one tlog of a better fitness
...
fix: checkStable was not used in all places in better master exists
fix: we need to call checkOutstanding on worker registration in all cases
fix: in case persistentData is keyValueStoreMemory, we need to make sure it is fully recovered before writing to it
2018-01-04 11:33:02 -08:00
Evan Tschannen
6d5dd9bd27
fix: we cannot pipeline disk queue commits until after the first commit is successful
2018-01-02 13:30:27 -08:00
Evan Tschannen
86958cb08d
Merge pull request #226 from cie/fix-taskBucket-unblockFuture
...
Modify TaskBucketCorrectness to support chain and multiple tasks
2017-12-20 18:00:54 -08:00
Yichi Chiang
91e5abeaa6
Modify TaskBucketCorrectness to support chain and multiple tasks
2017-12-20 17:02:49 -08:00
Alex Miller
f70e3b9fe8
Add or change a bunch of comments to provide descriptions of function contracts.
...
This cleans up a bit of the VersionStamp DR work I did, and leaves hints and
advice for anyone who will be touching mutation applying code in the future.
2017-12-20 16:57:14 -08:00
Evan Tschannen
982f0dcb1e
Merge pull request #222 from cie/alexmiller/drtimefix2
...
Fix yet another VersionStamp DR issue.
2017-12-20 15:09:23 -08:00
Alex Miller
b5a6bc0ab7
Fix VersionStamp problems by instead adding a COMMIT_ON_FIRST_PROXY transaction option.
...
Simulation identified the fact that we can violate the
VersionStamps-are-always-increasing promise via the following series of events:
1. On proxy 0, dumpData adds commit requests to proxy 0's commit promise stream
2. To any proxy, a client submits the first transaction of abortBackup, which stops further dumpData calls on proxy 0.
3. To any proxy that is not proxy 0, submit a transaction that checks if it needs to upgrade the destination version.
4. The transaction from (3) is committed
5. Transactions from (1) are committed
This is possible because the dumpData transactions have no read conflict
ranges, and thus it's impossible to make them abort due to "conflicting"
transactions. There's also no promise that if client C sends a commit to proxy
A, and later a client D sends a commit to proxy B, that B must log its commit
after A. (We only promise that if C is told it was committed before D is told
it was committed, then A committed before B.)
There was a failed attempt to fix this problem. We tried to add read conflict
ranges to dumpData transactions so that they could be aborted by "conflicting"
transactions. However, this failed because this now means that dumpData
transactions require conflict resolution, and the stale read version that they
use can cause them to be aborted with a transaction_too_old error.
(Transactions that don't have read conflict ranges will never return
transaction_too_old, because with no reads, the read snapshot version is
effectively meaningless.) This was never previously possible, so the existing
code doesn't retry commits, and to make things more complicated, the dumpData
commits must be applied in order. This would require either adding
dependencies to transactions (if A is going to commit then B must also be/have
committed), which would be complicated, or submitting transactions with a fixed
read version, and replaying the failed commits with a higher read version once
we get a transaction_too_old error, which would unacceptably slow down the
maximum throughput of dumpData.
Thus, we've instead elected to add a special transaction option that bypasses
proxy load balancing for commits, and always commits against proxy 0. We can
know for certain that after the transaction from (2) is committed, all of the
dumpData transactions that will be committed have been added to the commit
promise stream on proxy 0. Thus, if we enqueue another transaction against
proxy 0, we can know that it will be placed into the promise stream after all
of the dumpData transactions, thus providing the semantics that we require: no
dumpData transaction can commit after the destination version upgrade
transaction.
2017-12-20 15:04:04 -08:00
Stephen Atherton
e0d9cea008
Merge branch 'master' into continuous-backup
...
# Conflicts:
# fdbclient/FileBackupAgent.actor.cpp
# fdbrpc/BlobStore.actor.cpp
2017-12-19 23:02:14 -08:00
Alex Miller
c7dbd31a1e
Refactoring: Create a common prefixRange and do UID->Key once in backup.
2017-12-19 17:17:50 -08:00
Alex Miller
1488c12c18
Simulation will return and error and print if any non-suppressed SevError events were logged.
...
This means that loops like `seed=1; while ./fdbserver -r simulation -s $seed;
do seed=$(($seed+1)); done` to find an example of an often failing test. This
also means joshua will report ExitCode errors on anything that has a SevError
in the log.
As a part of this, we also implicitly downgrade any injected errors to SevWarnAlways.
2017-12-19 17:17:50 -08:00
Stephen Atherton
e28641886d
TraceEvent improvements. Minor bug fix, restore log writing tasks didn't have the log file endVersion but it's only for logging purposes.
2017-12-19 15:27:04 -08:00
Evan Tschannen
a5601877b3
fix: valgrind issue with destruction ordering
2017-12-18 15:31:59 -08:00
Evan Tschannen
1dc9eceb6d
optimize GetKeyLocationRequests on the proxy so they only require a single map lookup, instead of doing 3 + (3* [number of ranges]) lookups
2017-12-15 20:13:44 -08:00
Stephen Atherton
33f9f1a95c
Added SnapshotDispatch task for writing snapshots in random order over a specified period of time and adapting speed to a growing or shrinking database. TaskBucket now supports scheduling tasks. TaskFuture now correctly recognizes multiple tasks in its callback space. TaskBucket extendTimeout() now supports specifying the new timeout version. Submitting a backup now requires a snapshot duration.
2017-12-14 01:44:38 -08:00
Evan Tschannen
7ce93426ed
fix: connection disabler in removeServerSafely needs to run for the whole test to avoid getting stuck on include all
2017-12-12 18:38:57 -08:00
Alec Grieser
4495a19299
Merge pull request #220 from cie/alexmiller/flowprofcircus
...
Add class restrictions to CpuProfiler, and fix metric crash.
2017-12-11 14:13:22 -08:00
Evan Tschannen
73a0a07eac
clients ask for key location information directly from the proxy, instead of reading it from the database
2017-12-09 16:10:22 -08:00
Alex Miller
48660e9ce5
Add class restrictions to CpuProfiler, and fix metric crash.
...
This change largely refactors away the old meaning of the value given to
flow_profiler, which was the number of machines that we'd be profiling, and
instead replaces it with the classes of processes to profile for the duration
of the test. Most importantly, this means that one can profile in circus with
a configuration that has "ssd" in it, and the circus run will still complete
(as long as the argument isn't "storage").
And also finally add some other fixes I had to the same file to conditionally
change the name of the metric we're looking for to comply with what's actually
written.
2017-12-07 19:28:29 -08:00
Stephen Atherton
abb2dd1ebc
Merge pull request #214 from cie/alexmiller/fallocate
...
Use fallocate to zero ranges instead of writing zeroes
2017-12-06 13:47:40 -08:00
Evan Tschannen
5a947212ed
fix: ensure all prior commits have completed before returning that a commit has committed from the disk queue
2017-12-06 12:31:07 -08:00
Stephen Atherton
f8e89a40ac
Bug fixes, take(1) is incorrect usage of FlowLock.
2017-12-04 10:25:47 -08:00
Evan Tschannen
49dac11a5f
added a SevWarnAlways for when a disk queue file grows larger than 20GB
2017-12-01 15:05:17 -08:00
Evan Tschannen
482ac38ca6
added knobs so that the client failure monitoring update rate and the server failure monitoring update rate are separate knobs
2017-12-01 13:04:32 -08:00
Evan Tschannen
c3918d892a
do not use bandwidth splitting on the keyServer shard, lots of sets and clears to this shard generally means you do not want to create additional data distribution work
2017-11-30 18:28:16 -08:00
Alex Miller
196258080b
Refactor zeroing a chunk of a file from DiskQueue into IAsyncFile.
...
If we're going to do the work to provide more optimized ways to zero files,
then I'd feel better with this being in a more common place, so that any other
zero-ers are likely to reuse it. It also makes testing easier/more obvious.
Also, because it's needed for correctness, fix the aligned_alloc for OSX, which
wasn't aligned, and use an actually aligned allocation function.
2017-11-30 17:57:55 -08:00
Alex Miller
c7a120c59d
Rename IAsyncFile::incrementalDelete -> IAsyncFileSystem::incrementalDeleteFile.
...
`deleteFile` existed in IAsyncFileSystem, so an incremental delete function
seems to belong more as a virtual method on IAsyncFileSystem than a static
method on IAsyncFile, and the naming should match.
As long as we're here, change IAsyncFile to declare a virtual destructor, so
that it has good and proper C++ behavior. I presume this is what was vaguely
intended by the default constructor definition that previously existed?
2017-11-30 17:19:10 -08:00
Evan Tschannen
7f72aa7de5
fix: a storage server does not ever need to rollback before a version restored from disk
2017-11-30 11:19:43 -08:00
Evan Tschannen
e5a682948c
Merge pull request #212 from cie/check-cluster-controller-desired-class
...
Check cluster controller using desired process class in consistency c…
2017-11-29 15:57:51 -08:00
Yichi Chiang
8ba0eaebff
Check cluster controller using desired process class in consistency check
2017-11-29 15:09:23 -08:00