Alec Grieser
aadc06de99
Merge remote-tracking branch 'upstream/release-5.1'
2018-02-20 14:28:29 -08:00
Evan Tschannen
9ea963ddd6
fix: the master did not detect core state changes if it changed while writing
...
fix: do not attempt to use three_data_hall when in a fearless deployment
fix: log router tags are ephemeral and can be cleared after every recovery
2018-02-19 16:49:57 -08:00
Evan Tschannen
1b5628d2c5
testing a single configured fearless setup in simulated cluster
...
consolidated simulation connection disablers into one call in the tester
automatically reconfigure from a fearless setup in simulation
2018-02-18 12:59:43 -08:00
Evan Tschannen
31b89a638f
added satellite_none and remote_none options to unconfigure from a fearless setup
...
fix: log_router configuration was broken
2018-02-17 13:51:17 -08:00
Stephen Atherton
54fc81b260
Improved backup error reporting in backup status. The most recent error for each error type is reported along with how long ago the error occurred, and errors are divided into two categories based on whether or not they occurred since the most recent backup progress.
2018-02-16 19:38:31 -08:00
Evan Tschannen
dc93759e15
suppressed trace events that are spammy
2018-02-16 16:01:19 -08:00
Evan Tschannen
cb25564d38
simulated cluster supports fearless configurations
...
removed unused simulation variables
run the simulation with only 1 coordinator most of the time, since we protect the coordinator from being killed, and protecting too many things is bad for simulation
2018-02-15 18:32:39 -08:00
Evan Tschannen
ad19d3926b
fix: make sure there are enough machines in each dc to support triple replication for the configure workload
2018-02-14 17:06:22 -08:00
Evan Tschannen
5303962af6
re-enabled configure database and remove servers safely, even though they do not work with fearless
2018-02-14 16:07:23 -08:00
Evan Tschannen
ead3892e77
fix: prevent fast spin for future version
2018-02-14 15:16:18 -08:00
Evan Tschannen
110309272c
fix: do not count a server as read-write unless it has a recent version, because it could have been readable a long time ago
2018-02-14 15:09:19 -08:00
A.J. Beamon
3300c2efed
Enable slow task profiling in the consistency check processes.
2018-02-14 09:50:12 -08:00
Evan Tschannen
d2b0c07558
storage servers continue to attempt to pop old tags after the log system updates
2018-02-13 18:34:13 -08:00
Evan Tschannen
1fedcba890
fix: do not use log router tags when configured without remote logs
...
fix: data distribution tracks undesired storage servers
re-enabled consistency check
2018-02-13 17:01:34 -08:00
Evan Tschannen
a52ea4eb78
restored 5.1 functionality of simulated cluster. Will test assigned primary and remote data centers. Does not test remote replication or satellite logs
2018-02-10 13:27:51 -08:00
Evan Tschannen
42405c78a5
Merge commit '4038bd2fd968d88861f2cebd442ce511724816cb' into feature-remote-logs
...
# Conflicts:
# fdbserver/ClusterController.actor.cpp
# fdbserver/Knobs.cpp
2018-02-10 12:08:52 -08:00
Evan Tschannen
fbadcc6eea
changing a storage server’s tag must be the first mutations applied in a version, because privatized mutations applied earlier in the same version will use the old tag
2018-02-09 18:21:29 -08:00
Evan Tschannen
c7b3be5b19
re-enabled better master exists
...
the cluster controller can choose a better data center for itself and let the workers know where the next cluster controller should be recruited
2018-02-09 16:48:55 -08:00
Stephen Atherton
acb876d520
Merge branch 'release-5.1'
2018-02-07 15:11:52 -08:00
Evan Tschannen
d0caffd339
fix: knob was set to incorrect value
2018-02-06 18:11:45 -08:00
Stephen Atherton
3a49211c44
Merge branch 'release-5.1'
2018-02-06 13:58:35 -08:00
Stephen Atherton
7de40413d5
Merge branch 'release-5.1' of github.com:apple/foundationdb into release-5.1
2018-02-06 13:44:25 -08:00
Stephen Atherton
0792d5e3dd
Fix: last restorable version for a backup tag name (a separate value from the latest restorable version for a configured backup) was not being updated.
...
Fix: backup blob speed was sometimes an error because the JSON $sum merge operator did not support mixed numeric types.
Fix: JSON merge operator handling was squashing errors in some cases, which was generally obscuring the backup speed metric issue.
Cleaned up some of the JSON object merging logic.
Improved error messages in JSON merge operators. Added JSON merge operator tests for mixed numeric math and improved readability of test output.
2018-02-06 13:44:04 -08:00
Evan Tschannen
b7dde88029
fix: the cluster controller did not consider the master sharing the same process as the cluster controller as bad in all needed locations
...
waited too long for good recruitment locations, which would add too much time to recoveries of clusters that do not use machine classes
2018-02-06 11:30:05 -08:00
Evan Tschannen
63a9f2aed6
fix: history tags were being incorrectly popped
...
fix: history tags were not cleared when a storage server was removed
2018-02-03 12:20:18 -08:00
Evan Tschannen
ebd94bb654
removed a separately configurable storage team size for the remote data center, because it did not make sense
...
fix: the master did not monitor for the failure of remote logs
stop merge attempts when a data center is failed
fixed a variety of other problems with data distribution when a data center is failed
2018-02-02 11:46:04 -08:00
Evan Tschannen
766964ff48
fix: dest tags were not repopulated when the tag cache was cleared
2018-01-31 17:35:48 -08:00
A.J. Beamon
0c601d6f85
Purge past version references
2018-01-31 12:05:41 -08:00
Evan Tschannen
6b54d56ca7
gracefully exit if attempting to upgrade from 4.X versions
2018-01-30 17:10:50 -08:00
Evan Tschannen
b48d8ce96d
getTeam will return an unhealthy exact match if all teams are unhealthy. Resubmit relocation requests once healthy teams are available
2018-01-30 17:00:51 -08:00
Evan Tschannen
4160765fa1
added a buggify which reboots a server immediately after it has changed its locality
2018-01-29 18:21:28 -08:00
Evan Tschannen
af97a512f5
to support more complicated policies in the future for determining the best location for a tag within a set of tlogs, use an integer instead of a bool
2018-01-29 17:48:18 -08:00
Evan Tschannen
497bc3fe83
fix: txsTag needs to choose the same best location as 5.X version of the software
2018-01-29 17:09:35 -08:00
Evan Tschannen
29c5d4ad3d
upgrades from 5.X mostly supported, still some remaining correctness problems
2018-01-28 11:52:54 -08:00
Evan Tschannen
79d94214a4
Merge commit 'f4ffc9752b5ec66ac47f5f684a5d8be06a7eae6e' into feature-remote-logs
2018-01-25 10:12:06 -08:00
A.J. Beamon
2744646090
Merge branch 'release-5.0' into release-5.1
2018-01-22 11:57:58 -08:00
A.J. Beamon
188562ccbc
fix: Status should create its DatabaseConfiguration using fromKeyValues(). This makes sure that various state is correctly set if not specified in the configuration.
2018-01-22 11:40:08 -08:00
Evan Tschannen
66b2218989
added tlog support for upgrading from 5.X clusters. Does not support upgrading from 4.X or earlier. Untested, storage servers still need the ability to change their tag.
2018-01-21 12:21:46 -08:00
Evan Tschannen
698ef4117e
Merge branch 'master' into feature-remote-logs
2018-01-20 10:34:30 -08:00
Evan Tschannen
b5eba4f13a
fix: do not check for desired data centers if they have not been set
2018-01-20 10:28:59 -08:00
A.J. Beamon
35b91bfb55
Add back (in different form) some ratekeeper trace events when a storage server or log doesn't respond. Add actualTPS (named TPSBasis) to RkUpdate.
2018-01-18 14:51:38 -08:00
Evan Tschannen
b78e0a362a
fix: do not pause when running multiple backup tests simultaneously
2018-01-18 12:24:33 -08:00
Evan Tschannen
2e46ee3dba
fix: getTeam works when there are no teams
2018-01-17 17:49:13 -08:00
Evan Tschannen
264dc44dfa
fixed many more bugs associated with running without remote logs
2018-01-17 17:03:17 -08:00
Stephen Atherton
93b34a945f
Major usability and performance improvements to backup management. Backup descriptions now calculate and display timestamps using TimeKeeper data (if given a cluster) and restorability of snapshots. Expire now requires a --force option to leave a backup unrestorable or unrestorable after a given point in time, specified by version or timestamp. BackupContainerFilesystem now maintains metadata on key version boundaries in order to avoid large list operations for describe and expire operations. Blob parallel recursive list operations can now take a path (aka prefix) filter function. New describe and expire options are available in fdbbackup.
2018-01-17 04:09:43 -08:00
Evan Tschannen
8f58bdd1cd
fixed a large number of problems related to running without remote logs
2018-01-16 18:12:40 -08:00
Evan Tschannen
316e200a0c
fix: compilation errors after merge
2018-01-16 10:48:50 -08:00
Evan Tschannen
21482a45e1
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DBCoreState.h
# fdbserver/LogSystem.h
# fdbserver/LogSystemPeekCursor.actor.cpp
# fdbserver/TLogServer.actor.cpp
2018-01-14 13:40:24 -08:00
Evan Tschannen
645dc5ead6
warmRange needs to get a read version occasionally to prevent it from overwhelming the proxy
...
quietDatabase waits for all data distribution to be completely finished so that databases are cached in a cleaner state
2018-01-14 12:50:52 -08:00
Evan Tschannen
be643d6937
fix: the tlog did not cancel recovery properly when stopped
2018-01-12 17:18:14 -08:00
Evan Tschannen
3915d6825c
we need to check the server list at a higher priority, because if we do not notice a storage server interface change for a long period of time, we will mark it as failed
2018-01-12 12:51:07 -08:00
Evan Tschannen
de119f192d
fixed a priority inversion where the tlog would prefer to copy data from the previous generation rather than make data durable (leading to being ratekeeper controlled)
2018-01-11 16:09:49 -08:00
Evan Tschannen
29ebb19388
Merge branch 'release-5.0' into release-5.1
2018-01-11 15:43:37 -08:00
Evan Tschannen
22e5a0b257
formatting
2018-01-11 14:44:09 -08:00
Evan Tschannen
173a8de3ed
DBCoreState supports upgrades from 3.0 versions
2018-01-11 14:39:51 -08:00
A.J. Beamon
2f5073d00f
Some visual studio project cleanup.
2018-01-10 10:07:18 -08:00
Evan Tschannen
022df3b91b
backup and restore sometimes took too long in simulation
2018-01-09 17:26:42 -08:00
Evan Tschannen
645f68212b
make timekeeper priority system immediate
2018-01-08 18:21:00 -08:00
Evan Tschannen
370e8a9903
fix: split metrics could fail an assert in a very rare scenario
2018-01-08 18:20:22 -08:00
Evan Tschannen
9630deba3a
fixed a number of bugs related to running fearless without remote logs
2018-01-08 12:04:19 -08:00
Evan Tschannen
d3116fb336
masterRecoveryDuration is only a sevWarnAlways outside of simulation
2018-01-07 15:37:45 -08:00
Evan Tschannen
4e8bc273b3
added a version of getKeyRangeLocations that checks for endpoint failures
...
fix: did not add the cluster controller to id_used in all cases
removed obsolete fixmes
2018-01-07 15:32:43 -08:00
Evan Tschannen
30710f7493
syncLogId was not necessary
2018-01-06 14:52:39 -08:00
Evan Tschannen
3ec45d38a0
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-06 13:54:45 -08:00
Evan Tschannen
10c3fc165e
fix: after recovering from disk, only allow peeking data the was fully recovered
2018-01-06 13:49:13 -08:00
Stephen Atherton
b86f68ceb8
Added new test that combines atomic backup/restore. Added randomization to delays in AtomicRestore workload.
2018-01-05 14:43:21 -08:00
Evan Tschannen
63751fb0e2
fix: remote logs are not in the log system until the recovery is complete so they cannot be used to determine if this is the correct log system to recover from
2018-01-05 14:15:25 -08:00
Evan Tschannen
5ac4f73978
Merge branch 'release-5.1' into feature-remote-logs
...
# Conflicts:
# fdbclient/NativeAPI.actor.cpp
# fdbrpc/Locality.h
# fdbrpc/simulator.h
# fdbserver/ApplyMetadataMutation.h
# fdbserver/ClusterController.actor.cpp
# fdbserver/LogSystemPeekCursor.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/SimulatedCluster.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/TagPartitionedLogSystem.actor.cpp
# fdbserver/WorkerInterface.h
# fdbserver/masterserver.actor.cpp
# flow/Net2.actor.cpp
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-05 11:33:42 -08:00
A.J. Beamon
5015119115
Generalize the message that gets displayed in status if a cluster file's contents are incorrect.
2018-01-05 10:29:47 -08:00
Evan Tschannen
e11f461cbd
fix: better master exists needs to check master fitness before tlogs or proxies because that is the order of recruitment
2018-01-04 15:19:46 -08:00
Evan Tschannen
f8f1c48d83
sometimes test pausing backups
2018-01-04 11:40:08 -08:00
Evan Tschannen
f2c4beed9f
fix: tlogFitness did not consider it better to have one tlog of a better fitness
...
fix: checkStable was not used in all places in better master exists
fix: we need to call checkOutstanding on worker registration in all cases
fix: in case persistentData is keyValueStoreMemory, we need to make sure it is fully recovered before writing to it
2018-01-04 11:33:02 -08:00
Evan Tschannen
6d5dd9bd27
fix: we cannot pipeline disk queue commits until after the first commit is successful
2018-01-02 13:30:27 -08:00
Evan Tschannen
86958cb08d
Merge pull request #226 from cie/fix-taskBucket-unblockFuture
...
Modify TaskBucketCorrectness to support chain and multiple tasks
2017-12-20 18:00:54 -08:00
Yichi Chiang
91e5abeaa6
Modify TaskBucketCorrectness to support chain and multiple tasks
2017-12-20 17:02:49 -08:00
Alex Miller
f70e3b9fe8
Add or change a bunch of comments to provide descriptions of function contracts.
...
This cleans up a bit of the VersionStamp DR work I did, and leaves hints and
advice for anyone who will be touching mutation applying code in the future.
2017-12-20 16:57:14 -08:00
Evan Tschannen
982f0dcb1e
Merge pull request #222 from cie/alexmiller/drtimefix2
...
Fix yet another VersionStamp DR issue.
2017-12-20 15:09:23 -08:00
Alex Miller
b5a6bc0ab7
Fix VersionStamp problems by instead adding a COMMIT_ON_FIRST_PROXY transaction option.
...
Simulation identified the fact that we can violate the
VersionStamps-are-always-increasing promise via the following series of events:
1. On proxy 0, dumpData adds commit requests to proxy 0's commit promise stream
2. To any proxy, a client submits the first transaction of abortBackup, which stops further dumpData calls on proxy 0.
3. To any proxy that is not proxy 0, submit a transaction that checks if it needs to upgrade the destination version.
4. The transaction from (3) is committed
5. Transactions from (1) are committed
This is possible because the dumpData transactions have no read conflict
ranges, and thus it's impossible to make them abort due to "conflicting"
transactions. There's also no promise that if client C sends a commit to proxy
A, and later a client D sends a commit to proxy B, that B must log its commit
after A. (We only promise that if C is told it was committed before D is told
it was committed, then A committed before B.)
There was a failed attempt to fix this problem. We tried to add read conflict
ranges to dumpData transactions so that they could be aborted by "conflicting"
transactions. However, this failed because this now means that dumpData
transactions require conflict resolution, and the stale read version that they
use can cause them to be aborted with a transaction_too_old error.
(Transactions that don't have read conflict ranges will never return
transaction_too_old, because with no reads, the read snapshot version is
effectively meaningless.) This was never previously possible, so the existing
code doesn't retry commits, and to make things more complicated, the dumpData
commits must be applied in order. This would require either adding
dependencies to transactions (if A is going to commit then B must also be/have
committed), which would be complicated, or submitting transactions with a fixed
read version, and replaying the failed commits with a higher read version once
we get a transaction_too_old error, which would unacceptably slow down the
maximum throughput of dumpData.
Thus, we've instead elected to add a special transaction option that bypasses
proxy load balancing for commits, and always commits against proxy 0. We can
know for certain that after the transaction from (2) is committed, all of the
dumpData transactions that will be committed have been added to the commit
promise stream on proxy 0. Thus, if we enqueue another transaction against
proxy 0, we can know that it will be placed into the promise stream after all
of the dumpData transactions, thus providing the semantics that we require: no
dumpData transaction can commit after the destination version upgrade
transaction.
2017-12-20 15:04:04 -08:00
Stephen Atherton
e0d9cea008
Merge branch 'master' into continuous-backup
...
# Conflicts:
# fdbclient/FileBackupAgent.actor.cpp
# fdbrpc/BlobStore.actor.cpp
2017-12-19 23:02:14 -08:00
Alex Miller
c7dbd31a1e
Refactoring: Create a common prefixRange and do UID->Key once in backup.
2017-12-19 17:17:50 -08:00
Alex Miller
1488c12c18
Simulation will return and error and print if any non-suppressed SevError events were logged.
...
This means that loops like `seed=1; while ./fdbserver -r simulation -s $seed;
do seed=$(($seed+1)); done` to find an example of an often failing test. This
also means joshua will report ExitCode errors on anything that has a SevError
in the log.
As a part of this, we also implicitly downgrade any injected errors to SevWarnAlways.
2017-12-19 17:17:50 -08:00
Stephen Atherton
e28641886d
TraceEvent improvements. Minor bug fix, restore log writing tasks didn't have the log file endVersion but it's only for logging purposes.
2017-12-19 15:27:04 -08:00
Evan Tschannen
a5601877b3
fix: valgrind issue with destruction ordering
2017-12-18 15:31:59 -08:00
Evan Tschannen
1dc9eceb6d
optimize GetKeyLocationRequests on the proxy so they only require a single map lookup, instead of doing 3 + (3* [number of ranges]) lookups
2017-12-15 20:13:44 -08:00
Stephen Atherton
33f9f1a95c
Added SnapshotDispatch task for writing snapshots in random order over a specified period of time and adapting speed to a growing or shrinking database. TaskBucket now supports scheduling tasks. TaskFuture now correctly recognizes multiple tasks in its callback space. TaskBucket extendTimeout() now supports specifying the new timeout version. Submitting a backup now requires a snapshot duration.
2017-12-14 01:44:38 -08:00
Evan Tschannen
7ce93426ed
fix: connection disabler in removeServerSafely needs to run for the whole test to avoid getting stuck on include all
2017-12-12 18:38:57 -08:00
Alec Grieser
4495a19299
Merge pull request #220 from cie/alexmiller/flowprofcircus
...
Add class restrictions to CpuProfiler, and fix metric crash.
2017-12-11 14:13:22 -08:00
Evan Tschannen
73a0a07eac
clients ask for key location information directly from the proxy, instead of reading it from the database
2017-12-09 16:10:22 -08:00
Alex Miller
48660e9ce5
Add class restrictions to CpuProfiler, and fix metric crash.
...
This change largely refactors away the old meaning of the value given to
flow_profiler, which was the number of machines that we'd be profiling, and
instead replaces it with the classes of processes to profile for the duration
of the test. Most importantly, this means that one can profile in circus with
a configuration that has "ssd" in it, and the circus run will still complete
(as long as the argument isn't "storage").
And also finally add some other fixes I had to the same file to conditionally
change the name of the metric we're looking for to comply with what's actually
written.
2017-12-07 19:28:29 -08:00
Stephen Atherton
abb2dd1ebc
Merge pull request #214 from cie/alexmiller/fallocate
...
Use fallocate to zero ranges instead of writing zeroes
2017-12-06 13:47:40 -08:00
Evan Tschannen
5a947212ed
fix: ensure all prior commits have completed before returning that a commit has committed from the disk queue
2017-12-06 12:31:07 -08:00
Stephen Atherton
f8e89a40ac
Bug fixes, take(1) is incorrect usage of FlowLock.
2017-12-04 10:25:47 -08:00
Evan Tschannen
49dac11a5f
added a SevWarnAlways for when a disk queue file grows larger than 20GB
2017-12-01 15:05:17 -08:00
Evan Tschannen
482ac38ca6
added knobs so that the client failure monitoring update rate and the server failure monitoring update rate are separate knobs
2017-12-01 13:04:32 -08:00
Evan Tschannen
c3918d892a
do not use bandwidth splitting on the keyServer shard, lots of sets and clears to this shard generally means you do not want to create additional data distribution work
2017-11-30 18:28:16 -08:00
Alex Miller
196258080b
Refactor zeroing a chunk of a file from DiskQueue into IAsyncFile.
...
If we're going to do the work to provide more optimized ways to zero files,
then I'd feel better with this being in a more common place, so that any other
zero-ers are likely to reuse it. It also makes testing easier/more obvious.
Also, because it's needed for correctness, fix the aligned_alloc for OSX, which
wasn't aligned, and use an actually aligned allocation function.
2017-11-30 17:57:55 -08:00
Alex Miller
c7a120c59d
Rename IAsyncFile::incrementalDelete -> IAsyncFileSystem::incrementalDeleteFile.
...
`deleteFile` existed in IAsyncFileSystem, so an incremental delete function
seems to belong more as a virtual method on IAsyncFileSystem than a static
method on IAsyncFile, and the naming should match.
As long as we're here, change IAsyncFile to declare a virtual destructor, so
that it has good and proper C++ behavior. I presume this is what was vaguely
intended by the default constructor definition that previously existed?
2017-11-30 17:19:10 -08:00
Evan Tschannen
7f72aa7de5
fix: a storage server does not ever need to rollback before a version restored from disk
2017-11-30 11:19:43 -08:00
Evan Tschannen
e5a682948c
Merge pull request #212 from cie/check-cluster-controller-desired-class
...
Check cluster controller using desired process class in consistency c…
2017-11-29 15:57:51 -08:00
Yichi Chiang
8ba0eaebff
Check cluster controller using desired process class in consistency check
2017-11-29 15:09:23 -08:00
Evan Tschannen
8c51bc4ac4
fixed low latency tests in a way that gives us better test coverage
2017-11-28 18:20:29 -08:00
Evan Tschannen
dc624a54dc
fix: avoid flushing large queues in simulation when checking latency
2017-11-27 17:23:20 -08:00
Stephen Atherton
1b1c8e985a
Merge branch 'master' into backup-container-refactor
...
# Conflicts:
# fdbclient/FileBackupAgent.actor.cpp
2017-11-25 19:54:51 -08:00
Stephen Atherton
6695c9e6a2
Bug fixes and improvements to error handling and trace events. The most serious bug was that restore would start at the wrong version, possibly skipping early log and range files.
2017-11-25 00:46:16 -08:00
Alex Miller
f19cb3bbbd
Merge pull request #208 from cie/alexmiller/grvtfix
...
Fix the GRV performance regression
2017-11-17 15:00:44 -08:00
Yichi Chiang
d9a98aa968
Remove commented code
2017-11-16 17:25:37 -08:00
Yichi Chiang
0d5dc15ac8
Fix double recoveries
2017-11-16 16:58:55 -08:00
Alex Miller
e9412bbb11
Fix the GRV performance regression introduced by adding the policy engine to GRV calculations.
...
Construction of LocalityGroup from LocalityData is expensive, and the previous
code greatly ran afoul of that. The policy engine does a large amount of
interning of strings and building compressed maps to make the expected many
future selectReplica calls cheap. Unfortunately we don't call selectReplicas,
so much of this work is undesireable for us, and a large amount of CPU time is
spent doing this initialization work.
The new changes aggressively do the minimal LocalityGroup::add() calls
necessary, and make them as cheap as possibly by removing all elements from
LocalityData that don't need to be considered by the policy.
This optimization was also applied to the PeekCursor used during recovery,
which should speed recoveries up by a small amount.
2017-11-16 16:15:52 -08:00
Evan Tschannen
ad456a939a
Merge pull request #206 from cie/change-excluded-cluster-controller
...
Change excluded cluster controller
2017-11-15 17:28:33 -08:00
Yichi Chiang
f96faf72d9
Add fullyRecoveredConfig for checking exclusions
2017-11-15 17:15:24 -08:00
Evan Tschannen
30464e943c
Merge pull request #205 from cie/cleanup-spammy-traceevents
...
Cleanup spammy traceevents
2017-11-15 12:41:37 -08:00
Evan Tschannen
e113dba0e3
added a new trace event tracking master recovery durations
2017-11-15 12:38:26 -08:00
Stephen Atherton
a77162b53d
Merge branch 'master' into backup-container-refactor
...
# Conflicts:
# fdbclient/BackupAgent.h
# fdbclient/FileBackupAgent.actor.cpp
# fdbclient/KeyBackedTypes.h
2017-11-15 08:14:47 -08:00
Stephen Atherton
3dfaf13b67
IBackupContainer has been rewritten to be a logical interface for storing, reading, deleting, expiring, and querying backup data. The details of how the data is organized or stored is now hidden from users of the interface. Both the local and blobstore containers have been rewritten, the key changes being a multi level directory structure and no more use of temporary files or pseudo-symlinks in the blob store implementation. This refactor has a large impact radius as the previous backup container was just a thin wrapper that presented a single level list of files and offered no methods for managing or interpreting the file structure so all of that logic was spread around other places in the code base. This made moving to the new blob store schema very messy, and without this refactor further changes in the future would only be worse.
...
Several backup tasks have been cleaned up / simplified because they no longer need to manage the ‘raw’ structure of the backup. The addition of IBackupFile and its finish() method simplified the log and range writer tasks. Updated BlobStoreEndpoint to support now-required bucket creation and bucket listing prefix/delimiter options for finding common prefixes. Added KeyBackedSet<T> type. Moved JSONDoc to its own header. Added platform::findFilesRecursively().
Still to do: update command line tool to use new IBackupContainer interface, fix bugs in Restore startup.
2017-11-14 23:33:17 -08:00
Yichi Chiang
df922bc973
Change excluded cluster controller
2017-11-14 13:57:37 -08:00
A.J. Beamon
bb1297c686
Remove RkServerQueueInfo and RkTLogQueueInfo trace events, since this information is more or less already logged on the storage servers and tlogs. Update the quiet database check and magnesium to use the information from the logs and storage servers.
2017-11-14 12:59:42 -08:00
A.J. Beamon
3b952efb4e
Remove events from cluster controller that get logged for roughly every worker upon recovery, master registration, etc.
2017-11-14 10:15:45 -08:00
A.J. Beamon
0fea5e9c2f
Convert client_invalid_operation errors to ASSERTs.
2017-11-13 11:38:34 -08:00
A.J. Beamon
cd085764f1
Do not automatically change a cluster file that does not match what you expect.
2017-11-10 14:12:45 -08:00
Alex Miller
311d1ca87d
A variety of fixes that collectively fix using flow profiling in circus.
...
To run, use --co=flow_profiling=-1, because reasons.
2017-11-07 13:55:16 -08:00
Evan Tschannen
706bf1e018
fix: we cannot trigger better master exists before a master is fully recovered because exclusions changed by the provisional master will not be committed until the master is fully recovered
2017-11-04 12:48:04 -07:00
Evan Tschannen
57aba0b3bc
fix: excluded servers were the same fitness as storage servers for the master role
...
fix: better master exists did not considers exclusion for master fitness
2017-11-03 17:09:14 -07:00
Yichi Chiang
42fad5efe5
Introduce cluster controller process class in circus
2017-11-03 14:22:55 -07:00
Yichi Chiang
dcc9aafab7
Merge branch 'master' of github.com:apple/foundationdb
2017-11-02 10:47:59 -07:00
Yichi Chiang
c033d8efd8
Fix typo message and remove extra TraceEvent which overwrites the expected one
2017-11-02 10:47:51 -07:00
Balachandar Namasivayam
3efaaec479
onMasterProxiesChanged was being triggered when any member of ClientDBInfo changed. Change the behavior to be triggered only when proxies field in ClientDBInfo is changed.
2017-11-01 18:29:56 -07:00
A.J. Beamon
7cf17df821
Merge branch 'master' into log-group-for-unsupported-clients
...
# Conflicts:
# flow/Net2.actor.cpp
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2017-11-01 11:31:02 -07:00
A.J. Beamon
31caac67dc
Rename supported_versions[x].clients to supported_versions[x].connected_clients
2017-11-01 10:41:30 -07:00
Balachandar Namasivayam
988bc0207f
Reset Client Transaction profiling parameters when the config keys are cleared.
2017-10-31 15:40:57 -07:00
Alec Grieser
5a4a5985fd
Merge branch 'release-5.0'
2017-10-30 08:31:23 -07:00
Alec Grieser
87321f5017
Merge branch 'release-4.6' into release-5.0
2017-10-30 08:31:01 -07:00
Evan Tschannen
54d82c0d92
Merge pull request #194 from cie/alexmiller/valgrind
...
Fix valgrind errors
2017-10-27 17:25:12 -07:00
Alex Miller
e0d33ef8d7
Preemptively fix profiler-related valgrind errors/straight out bugs.
...
I forgot to initialize some fields in requests.
2017-10-27 17:20:19 -07:00
Evan Tschannen
aa0c2ae317
only increase the max shard size if the shard begins in the keyServer keyspace, do not increase the minimum shard size
2017-10-27 14:22:26 -07:00
Evan Tschannen
3a4078bdda
the keyservers shards are always a fixed large size
2017-10-27 11:52:11 -07:00
Balachandar Namasivayam
cfefab18fb
Merge branch 'master' into add-new-atomic-ops
2017-10-25 18:03:34 -07:00
Balachandar Namasivayam
3d5658940a
Addressed Review Comments
2017-10-25 16:42:05 -07:00
Balachandar Namasivayam
9dd588dcce
Addressed review comments.
...
Changed naming for NewMin and NewAnd to MinV2 and AndV2
2017-10-25 14:48:05 -07:00
Evan Tschannen
d852a53ae4
Merge pull request #181 from cie/throttle-spammy-logs
...
Throttle spammy logs
2017-10-25 13:45:55 -07:00
Balachandar Namasivayam
2f6d55a52f
Add correctness tests for all atomic ops
2017-10-25 13:36:49 -07:00
Yichi Chiang
4d54a73f5b
Merge pull request #191 from cie/count-cluster-controller-role
...
Take cluster controller role into consideration when recruiting workers
2017-10-25 12:09:15 -07:00
Yichi Chiang
f39cce9b8d
Use processId instead of address for comparison
2017-10-25 11:35:29 -07:00
Yichi Chiang
5fcef911f0
Take cluster controller role into consideration when recruiting workers
2017-10-25 10:35:46 -07:00
Evan Tschannen
48901a9223
added a list of tlog IDs that are missing to status
2017-10-24 16:28:50 -07:00
Yichi Chiang
c2a117fe07
Merge pull request #189 from cie/enable-check-desired-class
...
Enable checkUsingDesiredClasses() in consistency check
2017-10-24 15:18:21 -07:00
Yichi Chiang
defdc6550d
Exclude excluded processses when getting testers
2017-10-24 15:16:34 -07:00
Evan Tschannen
df74e2a373
re-added support for non-copying tlog recovery
2017-10-24 15:09:31 -07:00
Yichi Chiang
3865c5ae0e
Enable checkUsingDesiredClasses() in consistency check
2017-10-24 12:58:54 -07:00
Balachandar Namasivayam
8c3bdc5b3b
Make atomic ops differentiate between unset and empty values.
2017-10-23 16:48:13 -07:00
Evan Tschannen
7a36fd2134
disabled a variety of simulation tests to get correctness clean
2017-10-19 15:49:54 -07:00
Evan Tschannen
e2c1e87df6
made a large number of fixes to make fearless DR correctness clean.
2017-10-19 15:36:32 -07:00
Bhaskar Muppana
360b777b78
Fail with correct error code in case of abort or discontinue of
...
non-existing backups.
2017-10-18 23:17:48 -07:00
Alec Grieser
dd6d8f3b0e
Merge branch 'master' into add-new-atomic-ops
2017-10-18 16:36:44 -07:00
Bhaskar Muppana
2007f3799f
Don't ignore TimeKeeper failures.
2017-10-18 14:31:31 -07:00
Bhaskar Muppana
314511f4d7
Fixing spaces in BackupCorrectness TraceEvents.
2017-10-18 14:27:52 -07:00
Alex Miller
7b9bc1d715
Merge pull request #170 from cie/alexmiller/flowprofile
...
Add support for profiling a running fdb cluster to fdbcli, fix security issues, and add an improved backtrace.
2017-10-16 16:51:53 -07:00
Alex Miller
f997cb9038
Add a string knob to hold the Log directory, and write profiles to it.
...
This is the combination of two small changes.
1. Add support for a string knob type.
2. Change profiles to be written to the log directory instead of the working
directory.
We have three options of where to write files: the working directory, the data
directory, and the log directory.
The working directory may be set to a non-writable location, and likely
contains the fdb binaries. Allowing these files to be overwritten would likely
not be a wise idea.
The data directory hosts our sqlite b-trees. It would also be very unfortunate
if these were ever overwritten by an unfortunate profile name.
The log directory contains logs. Out of the three, these matter the least if
they disappear or become corrupted.
Thus, we write to the log directory.
2017-10-16 16:05:02 -07:00
Alex Miller
c5fbe33df6
Disallow arbitrary paths for storing profiles.
...
Previously, one could request profiles to be stored at
"../../../../../../etc/passwd". Now we expand the paths, including symlinks,
and ensure that the target is a child of the targetted subdirectory. This was
the least convoluted way I could figure out to handle paths.
2017-10-16 16:05:02 -07:00
Alex Miller
91a26a170c
Add toggleable profiling support to fdbserver+fdbcli.
...
This adds the fdbcli commands:
* profile list -- Lists all workers in a way that doesn't fill `kill`'s list.
* profile flow run -- Allows starting flow profiling on a set of hosts for a specified interval.
And threads through all the support for enabling and disabling profiling as an RPC.
2017-10-16 16:05:02 -07:00
Balachandar Namasivayam
312f614133
Add the new ops and AND to NON_ASSOCIATIVE_MASK.
...
In the storage server, read the entire value if the op is ByteMin or ByteMax.
2017-10-16 11:06:31 -07:00
Alec Grieser
e0be1ef1e0
Merge branch 'release-5.0'
2017-10-16 10:08:11 -07:00
Alec Grieser
432726ba2d
Merge branch 'release-4.6' into release-5.0
2017-10-16 09:54:21 -07:00
Stephen Atherton
68eccb681e
Merge pull request #173 from bmuppana/master
...
Backup log messages.
2017-10-13 18:31:53 -07:00
Evan Tschannen
215bcb8d3e
Merge pull request #157 from cie/choose-leader-on-stateless-processes
...
Catch and update processClass change from DBSource
2017-10-13 14:03:29 -07:00
Yichi Chiang
5bcdd37c0d
Move UID generation and add initialClass
2017-10-13 13:46:37 -07:00
Yichi Chiang
12edd27281
Introduce prevChangeID to CandidacyRequest and LeaderHeartbeatRequest
2017-10-12 17:11:58 -07:00
Bhaskar Muppana
d1e9d28239
Backup log messages.
2017-10-12 16:12:42 -07:00
Stephen Atherton
11517f7bfc
Merge branch 'master' into continuous-backup
...
# Conflicts:
# fdbclient/FileBackupAgent.actor.cpp
2017-10-12 11:03:23 -07:00
Alex Miller
c24b941485
Fix erroneous std::move in indexed set, and clean up addMetric users.
...
This is a follow-on to c4eb73d0. Thanks to Bala for pointing out the unchanged
std::move usage, and there appeared to not be many existing users of addMetric
anyway.
2017-10-11 17:36:51 -07:00
Balachandar Namasivayam
8e0bea2795
Update API_VERSION from 500 to 510
2017-10-11 13:49:38 -07:00
Stephen Atherton
c3d8412abb
Merge pull request #166 from cie/alexmiller/deathservice
...
Fix potential division by zero issues via RPC.
2017-10-10 16:47:38 -07:00
Evan Tschannen
ff1b49be2e
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DatabaseConfiguration.cpp
2017-10-10 16:07:59 -07:00
Evan Tschannen
8feb3b8fbc
fixed conflict range workload by just disabling timeKeeper instead of the check, because it should be a more robust fix
2017-10-10 16:01:02 -07:00
Balachandar Namasivayam
eeebf10030
Modified existing behavior of MIN and AND atomic ops. The new behavior results in a 'SET' if the atomic op is performed on a non -existing key.
...
Added new atomic ops ByteMin and ByteMax that does lexicographic comparison of byte strings.
2017-10-10 13:02:22 -07:00
Evan Tschannen
c8525dc3e7
timekeeper is constantly changing keys in the system keyspace, so do not report errors on key mismatches on keys in the system keyspace
2017-10-10 12:04:56 -07:00
Evan Tschannen
3d2103075d
data distribution tracks teams for each data center separately
2017-10-10 10:36:33 -07:00
Evan Tschannen
5e6eba365b
fix: always set confChange, because popVersion is not deterministic across proxies, and confChange needs to be set deterministically
2017-10-06 18:37:08 -07:00
Evan Tschannen
93b3d0e4e7
fix: toMap didn’t report logs proxies and resolvers
2017-10-06 15:55:50 -07:00
Evan Tschannen
15962cf079
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbrpc/Locality.cpp
# fdbrpc/Locality.h
# fdbserver/ClusterController.actor.cpp
# fdbserver/ClusterRecruitmentInterface.h
# fdbserver/TLogServer.actor.cpp
# fdbserver/TagPartitionedLogSystem.actor.cpp
# fdbserver/WorkerInterface.h
# fdbserver/fdbserver.vcxproj.filters
# fdbserver/masterserver.actor.cpp
# fdbserver/worker.actor.cpp
# flow/error_definitions.h
2017-10-05 17:09:44 -07:00
Alex Miller
a21c8a820b
Move cpuProfilerRequest from WorkerInterface to ClientWorkerInterface.
...
A way to access this stream is required if we wish to be able to toggle
profiling from fdbcli. There's two ways to do this:
1. Use `monitorLeader()` to get a `ClusterControllerFullInterface`, and use
`getWorkers` from there to get a list of `WorkerInterface`s, from which we can
access cpuProfilerRequest.
2. Move cpuProfilerRequest to ClientWorkerInterface and use the existing code
in the client that can fetch a list of all `ClientWorkerInterface`s.
The split between WorkerInterface and ClientWorkerInterface appears to be
what a client might have a need to call versus what is fdbserver-internal (and
thus no client should even want to call). Thus, it seems to make more sense to
acknowledge that profiling is useful to be able to toggle from a client, and go
with option (2).
2017-10-05 14:08:28 -07:00
Yichi Chiang
3edc2824a9
Add initialClass to RegisterWorkerRequest 2
2017-10-05 11:03:25 -07:00
Yichi Chiang
05f7626e39
Add initialClass to RegisterWorkerRequest
2017-10-04 17:11:12 -07:00
Yichi Chiang
3c70df57b5
Fix cluster controller review comments
2017-10-04 15:48:55 -07:00
Alex Miller
e55cc447d2
Address code review comments.
...
* Fixed memory corruption with SystemData key constants
* Removed duplication in ClusterController
* Reworked fdbcli actions to better represent explicit vs default assignments
2017-10-04 13:36:18 -07:00
A.J. Beamon
5063793f36
Revert line ending change
2017-10-04 11:19:19 -07:00
Alex Miller
706427ee62
Fix potential division by zero issues via RPC.
...
A carefully crafted SplitMetricRequest could have caused division by zero.
It's not really great to offer Division By Zero As A Service, so let's just
return an error instead.
2017-10-03 22:11:08 -07:00
Evan Tschannen
3a2ddcc84a
Add destinations that are read-write to the source list, so that cancelled data movement can contribute to copying the data for the next movement.
2017-10-03 17:39:08 -07:00
Balachandar Namasivayam
0e153cdd35
Throttle Spammy logs. Three knobs are added.
...
Trace Events are sampled and cached with an expiration set. Every TraceEvent above SevDebug is checked against this cache to see if it exceeded a set threshold. If yes, then throttle the TraceEvent.
If a TraceEvent is throttled, a warning msg is logged.
2017-10-02 18:43:11 -07:00
Evan Tschannen
6ea9903c82
Merge branch 'release-5.0'
...
# Conflicts:
# fdbbackup/backup.actor.cpp
# fdbserver/ClusterController.actor.cpp
# versions.target
2017-10-01 18:46:44 -07:00
Evan Tschannen
0949c4be65
Revert "Fixed problem with master being recruited on excluded servers"
...
This reverts commit 1f7b624734a8ad6e896dd3f01f9cdf334ca62486.
2017-10-01 16:30:19 -07:00
Evan Tschannen
696d432462
Revert "fix: excluded servers are worst fit for master rather than never assign (so that we can recover if every process has been excluded)"
...
This reverts commit 83b2ce68c8e1a29fc1559598cc38d3ef7eb46101.
2017-10-01 16:29:32 -07:00
Evan Tschannen
0dde15f1d2
fix: excluded servers are worst fit for master rather than never assign (so that we can recover if every process has been excluded)
...
fix: better master exists did not use exclusions because the configuration was reset
2017-10-01 16:26:58 -07:00
Yichi Chiang
636ce4a131
Replace leader when find a better one
2017-09-29 16:34:55 -07:00
Alex Miller
11668bb359
Fixing code review comments.
2017-09-29 15:58:36 -07:00
Alex Miller
b7ce9d996c
Comment out verbose TraceEvents in preparation for pushing.
2017-09-29 15:58:36 -07:00
Alex Miller
c40c1bb5fe
Add a new workload: BackupToDBAbort, which does an ACI switchover.
...
This is to allower easier testing of non-durable switchovers without having to
wiggle into BackupToDBCorrectness's view of the world.
2017-09-29 15:58:36 -07:00
Alex Miller
9e9a96ae76
Make VersionStamp workload able to run with DR-style workloads.
...
* It is now tolerant of locked database errors, and handles them correctly.
* There is an option to specify which database to verify against.
2017-09-29 15:58:36 -07:00
Alex Miller
34630b6130
Make VersionStamp workload can handle commit_unknown_result.
...
Previously, if a transaction failed with commit_unknown_result, and was
actually committed, it would look like data that magically appeared in the
database and verification would fail.
Now, we explicitly re-read and check to see if the commit happened, so that we
may maintain an accurate understanding of what the database state should be.
2017-09-29 15:58:36 -07:00
Alex Miller
23945b9fea
VersionStamp can co-exist with other workloads that write data to the database.
...
VersionStamp previously would range-read the entire database during validation.
This has the unfortunate effect of making it fail during validation if run with
any other workload that writes keys to the database.
Now, all keys written and read are done with a configurable prefix, so that it
may co-exist with a variety of other workloads.
2017-09-29 15:58:36 -07:00
Alex Miller
370a6afb80
Make VersionStamp have an option to be tolerant of data being lost.
2017-09-29 15:58:36 -07:00
Alex Miller
8f4c45418b
Make atomicSwitchover preserve an ever-increasing commit version.
2017-09-29 15:58:36 -07:00
Alex Miller
69523ce151
Hackish version of a test, but it does fail.
2017-09-29 15:58:36 -07:00
Alex Miller
65713b226f
Fix whitespace and line endings.
2017-09-29 15:58:36 -07:00
Evan Tschannen
e2b65e86ed
added configurable memory limits for backup and dr executables
...
added a default memory limit of 8GB for fdbcli
2017-09-29 10:35:40 -07:00
Bhaskar Muppana
91975244fe
Fixing OSX build.
2017-09-28 19:35:44 -07:00
Bhaskar Muppana
942c04e992
Merge pull request #162 from bmuppana/master
...
Fixing TimeKeeperCorrectness to deal with network delays.
2017-09-28 17:04:39 -07:00
Bhaskar Muppana
3d2bafc3a6
Fixing TimeKeeperCorrectness to deal with network delays.
2017-09-28 16:52:28 -07:00
Evan Tschannen
ef41b07bb3
renamed past_version to transaction_too_old
...
implemented read_lock_aware option
2017-09-28 16:35:08 -07:00
Yichi Chiang
d4f75630de
Support log group field in status json
2017-09-28 16:31:29 -07:00
Evan Tschannen
7b60e26660
Merge pull request #160 from cie/use-error-descriptions
...
Add the ability to access name and description in Error. Update error…
2017-09-28 16:00:39 -07:00
Evan Tschannen
5f4b997400
emergency teams are bad for performance, because we will route client read requests to servers that do not have the data, therefore getting many wrong shard server errors. emergency teams only protect us from data loss in very rare scenarios, we may want to add them in again in the future, but make sure load balance knows which storage servers used to be destinations so they can only route to them as a last resort.
2017-09-28 13:20:01 -07:00
Evan Tschannen
73fca75239
added the ability to disable timeKeeper; disabled timeKeeper before consistency check in simulation
2017-09-28 13:13:24 -07:00
A.J. Beamon
d30c730f75
Add the ability to access name and description in Error. Update error descriptions.
2017-09-28 12:35:03 -07:00
Bhaskar Muppana
0f8ff26029
Merge pull request #158 from bmuppana/master
...
<rdar://problem/34557380> Need a way to map real time to version
2017-09-27 17:56:42 -07:00
Bhaskar Muppana
6a0b1d6808
Fixing PR comments
...
<rdar://problem/34557380> Need a way to map real time to version
2017-09-27 17:56:01 -07:00
Evan Tschannen
4b21da1cd6
fix: lastVersionWithData was not updated when fetchKeys injects mutations
2017-09-27 10:44:34 -07:00
Evan Tschannen
acb7e66d01
fix: failed logs do not count even if they have returned a result
2017-09-25 18:14:40 -07:00
Evan Tschannen
2bf042a559
fix: file_corrupt was not checking for fault injection
...
latency threshold was too long
2017-09-25 17:22:41 -07:00
Bhaskar Muppana
0bf5bdb23a
<rdar://problem/34557380> Need a way to map real time to version
2017-09-25 12:51:37 -07:00
Yichi Chiang
6758c649fc
Catch and update processClass change from DBSource
2017-09-25 10:36:03 -07:00
Evan Tschannen
cce4eeb52d
fix: the master was sending the cluster controller uninitialized configurations
2017-09-22 16:59:24 -07:00
Evan Tschannen
180438d41e
fix: use the number of present logServers rather than the total size of the vector
2017-09-22 16:19:16 -07:00
Evan Tschannen
738ae21c3a
fix: an optimization in buggified locking can cause recovery to break because it would not restart if a locked process was killed when the remaining logs cannot obtain a quorum
2017-09-22 15:07:57 -07:00
Alex Miller
585c9bf68f
Quick fix to reduce CPU usage of ensureEpochLive.
...
It is suspected that policy recomputations are driving proxy CPU usage up, and
thus latency and throughput down. To quickly confirm this theory, we're
forcing ensureEpochLive to wait until it has RF responses, which means we'll
probably only validate the policy once per call.
2017-09-21 18:22:24 -07:00
Evan Tschannen
fbd67ea547
fix: excluded servers are worst fit for master rather than never assign (so that we can recover if every process has been excluded)
...
fix: better master exists did not use exclusions because the configuration was reset
2017-09-20 11:48:26 -07:00
Evan Tschannen
cb43563b2d
fix: toMap properly lists the redundancy mode of the cluster
2017-09-19 16:35:42 -07:00
Evan Tschannen
f75dfc3153
do not register with the master until recovery of the queue is complete, to avoid having the master wait a long time for a peek response
2017-09-18 17:39:12 -07:00
Alex Miller
567d663afd
Fix SimulationConfig never generating a custom config.
...
A 0 was changed to a 1 when rewriting code, and `case 0:` was never being hit. :(
Thankfully, it looks like nothing was broken by this in the meantime.
2017-09-18 17:29:36 -07:00
Evan Tschannen
e8b895c878
added the ability to disable connection failures for a period of time after one happens
2017-09-18 12:46:29 -07:00
Evan Tschannen
489332533c
all timeouts longer than two minutes have been can be lowered to 60.0 with buggification
...
added a workload that tries for a 50 second maximum latency in the presence of one failure with both buggification and connection failures
2017-09-18 11:04:51 -07:00
Evan Tschannen
34f987f56d
added a test in simulation which ensures that a recovery after a single failure takes less than 15 seconds
2017-09-15 17:55:01 -07:00
Evan Tschannen
d9b64899c5
fix: we need to wait for log server failures if we have not locked all of the logs
2017-09-15 13:11:21 -07:00
Evan Tschannen
36c98f18e9
do not register a worker with the cluster controller until it has finished recovering all files from disk
2017-09-15 10:57:58 -07:00
Evan Tschannen
f3b7aa615d
fix: seed storage servers are recruited based on the storage policy
2017-09-14 17:06:00 -07:00
Alvin Moore
9404d226d0
Merge branch 'release-5.0'
2017-09-13 16:49:00 -07:00
Alvin Moore
cb92194772
Fixed problem with master being recruited on excluded servers
2017-09-13 16:48:27 -07:00
Alex Miller
5e14f19875
Merge pull request #147 from cie/alexmiller/grvtlogs
...
Only verify a quorum of TLogs are unlocked for a GRV request
2017-09-13 16:07:25 -07:00
Alex Miller
d6b3be98fe
Fix whitespace.
2017-09-13 15:49:39 -07:00
Alex Miller
06a9c7a772
Remove unnecessary policy recomputations in confirmEpochLive.
...
Watching for interface changes on readied servers was done as a workaround for
a case where all futures could be ready, but the policy verification would
never succeed. This turns out to be because stopping a tlog causes an error to
be returned. However, if a TLog is stopped, then we know that we can't do any
more commits, so we can just immediately stop trying and never mark our future
as ready.
2017-09-13 15:45:09 -07:00
Evan Tschannen
8cb53fd608
Merge pull request #149 from cie/choose-leader-on-stateless-processes
...
choose leader on the perferred process class
2017-09-13 13:58:49 -07:00
Evan Tschannen
aea7a78cff
cluster controller changes were not maintained during merge
2017-09-11 17:40:46 -07:00
Evan Tschannen
d343d37274
fixed merge problems
2017-09-11 16:37:10 -07:00
Evan Tschannen
76e7988663
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/ClusterController.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/OldTLogServer.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/WorkerInterface.h
# flow/Net2.actor.cpp
2017-09-11 15:15:56 -07:00
A.J. Beamon
4fa2415553
Merge branch 'release-5.0'
2017-09-08 17:28:12 -07:00
A.J. Beamon
bb8a245bdb
circus: throughput test scales latency error by the target latency
2017-09-08 17:27:54 -07:00
Evan Tschannen
ea26bc1c43
passed first tests which kill entire datacenters
...
added configuration options for the remote data center and satellite data centers
updated cluster controller recruitment logic
refactors how master writes core state
updated log recovery, and log system peeking
2017-09-07 15:32:08 -07:00
Yichi Chiang
bd1c7e7295
Use addTeamsBestOf() instead of addAllTeams() when team size is greater than 3
2017-09-07 12:31:01 -07:00
Bhaskar Muppana
c7df951f7c
Using BackupConfig from backup.actor.cpp to reduce intermediate
...
functions.
2017-09-07 08:36:36 -07:00
Bhaskar Muppana
fe208d6adf
Merge branch 'master' of github.com:apple/foundationdb into backup
2017-09-06 10:01:55 -07:00
Bhaskar Muppana
83810edabc
Backup/Restore tag can be std::string instad of Key.
2017-09-05 11:38:40 -07:00
Evan Tschannen
dc1f7ca6b7
testers now use client locality load balancing
2017-09-01 12:53:01 -07:00
A.J. Beamon
cc24072a5d
Add the multi version API to the list of APIs to choose in the APICorrectness tester. Support for the multi-version client already existed.
2017-08-31 16:23:55 -07:00
Evan Tschannen
d61be4c760
Merge branch 'release-5.0'
2017-08-30 12:59:24 -07:00
Evan Tschannen
963e1c3f31
fix: we need to reboot the process even if it will result in too many files, because the check will not succeed without it
2017-08-30 12:58:46 -07:00
Alex Miller
8d97a15c3f
BUGGIFY recovery to lock only the minimum number of TLogs required to prevent a quorum.
...
This is to test the quorum logic introduced in the previous patch, and should
flush out any other bugs that rely on TLog locking during recovery.
2017-08-29 14:43:40 -07:00
Alex Miller
f8486d1368
Only ensure a quorum of TLogs are unlocked to confirm the epoch hasn't ended.
...
Currently, GRV will wait to hear back from (almost) all TLogs to confirm that
they're unlocked and that the current epoch hasn't ended. This confirms that
there isn't a new set of proxies and using the commit version from the old set
of proxies would violate causal consistency.
However, during recovery, we ensure that no quorum of TLogs exists before
starting a new epoch and allowing new commits on the new TLogs. Thus, we only
need to wait until we have a quorum of TLogs that are unlocked.
This should be a significant improvement in latency particularly for the cases
when we start running >10 TLogs.
2017-08-29 14:43:40 -07:00
Alex Miller
4c1d61cd08
Assorted minor changes.
...
In which we:
* Clarify some math in a comment
* Remove misleading debugging information
* Add a useful trace event
2017-08-29 14:43:40 -07:00
Alex Miller
dbfa94f735
LF -> CRLF
...
It appears a previous patch left parts of this file ending with LF, and the
majority of the file ends in CRLF. I see no reason to keep this inconsistency,
but these line ending wars are going to drive me insane.
2017-08-29 14:43:40 -07:00
Alvin Moore
6020d70863
Added trace event to track reboots initiated by ConsistencyCheck workload in simulation
2017-08-29 11:41:27 -07:00
Alvin Moore
c95a1be5ec
Add trace event for rebooting process during simulation for consistency check
2017-08-29 11:00:44 -07:00
A.J. Beamon
86774f6e42
Merge branch 'release-5.0'
2017-08-28 17:17:00 -07:00
A.J. Beamon
03478561b9
fix: Set lock aware at the transaction level for latency probe to avoid having to fill the shard cache every time.
2017-08-28 17:16:46 -07:00
A.J. Beamon
9a0a3b6329
Merge commit '66528becb82d826e81fa644bb378212584ab580e'
2017-08-28 16:47:59 -07:00
Yichi Chiang
9fe927127f
choose leader on the perferred process class
2017-08-28 14:41:04 -07:00
Alvin Moore
44e0df78c5
Added support for tracking roles for simulation workers
...
Fixed the exclusion and inclusion address simulation API and integration within workloads
Added more information within trace events for simulation
2017-08-28 11:25:37 -07:00
Alvin Moore
581bd6c8ed
Added option to delay the displaying of the simulation workers
2017-08-28 10:53:56 -07:00
Alec Grieser
300b5a17ed
Merge branch 'release-5.0'
2017-08-25 18:55:33 -07:00
Evan Tschannen
272b4b984c
fix: fixed a rare bug where we do not wait for a file in the process of being deleted to shutdown before rebooting a machine
2017-08-25 10:12:58 -07:00
Evan Tschannen
26a5b5e422
rollback workload now clogs the communication between one of the proxies and the tlogs, since that is what will cause a rollback
2017-08-23 16:08:13 -07:00
A.J. Beamon
4c706d33e9
Merge branch 'release-5.0'
2017-08-23 14:59:43 -07:00
Evan Tschannen
be941b4bd1
sending void to committed could cause self to be deleted, so call cleanup before sending
2017-08-23 13:56:18 -07:00
Alvin Moore
7729f663e9
Ensured that the circus id is always lowercase
2017-08-23 13:45:00 -07:00
Evan Tschannen
f9308b8fa6
Merge pull request #145 from cie/alexmiller/simrefactor
...
Refactor simulation to pull all configuration parameters into one struct.
2017-08-23 12:54:21 -07:00
Evan Tschannen
4b40f817f1
fix: is recovery is cancelled before the copy is complete, remove the tlog
2017-08-23 12:26:03 -07:00
Alvin Moore
8056b78414
Merge branch 'release-5.0'
2017-08-22 13:51:19 -07:00
Alvin Moore
814e471689
Added support for displaying initial workers via printf within simulation using a workload
2017-08-22 13:38:24 -07:00
Alex Miller
7b78035365
Have SimulationConfig wrap DatabaseConfiguration to reduce code duplication.
...
This effectively turns initializing SimulationConfig into the equivalent of
building a config string and calling buildConfiguration on it.
2017-08-22 10:13:57 -07:00
Alex Miller
9b25c72971
Pull database config and cluster config into one struct.
...
This will allow us to specify custom situations to be chosen more frequently,
and in particular control machines and processes.
2017-08-21 22:35:44 -07:00
Alec Grieser
5ee07b1a9e
Merge branch 'release-5.0'
2017-08-14 16:56:58 -07:00
Evan Tschannen
de1b590a8a
The TLog did not delete data from removed logs
...
The TLog continued to make data from removed logs persistent
2017-08-11 18:08:09 -07:00
Stephen Atherton
50fb44be92
Merge branch 'release-5.0'
...
# Conflicts:
# versions.target
2017-08-09 23:36:12 -07:00
Evan Tschannen
2335fc73f2
fix: peek cursors were being timed out every 10 minutes, instead of 10 minutes after the last use
...
fix: if an interface is changed while we are not waiting in getMore, we will not reset the sequence to 0.
2017-08-09 15:58:06 -07:00
Evan Tschannen
47a37f3f1e
Merge pull request #135 from cie/switch-for-data-distribution
...
Add a switch to turn off data distribution in CLI
2017-08-07 12:54:08 -07:00
Evan Tschannen
c22708b6d6
added tag localities
...
fix: remote logs need to stop the master when they are stopped
2017-08-03 16:16:36 -07:00
Alec Grieser
ca7437ecf6
Merge branch 'release-5.0'
2017-08-02 22:07:01 -07:00
John King
d0fbc41338
set LOCK_AWARE on several transactions used for getting cluster info for the consistency check
2017-07-28 18:50:32 -07:00
Yichi Chiang
6a8a5c41b0
Add a switch to turn off data distribution in CLI
2017-07-28 18:14:55 -07:00
A.J. Beamon
4243486f54
fix: TLogMetrics was being track latested with the wrong ID
2017-07-28 14:37:23 -07:00
Yichi Chiang
37e5e2acbb
Fix parentheses issue in StorageMetrics.actor.h
2017-07-27 12:03:36 -07:00
Yichi Chiang
cdc62e265c
Merge pull request #133 from cie/shard-system-keyspace
...
Shard system keyspace
2017-07-26 17:09:13 -07:00
A.J. Beamon
41c90bcdea
Merge commit '89ac94853c70d08289e7fb58055bc5d0cd4e494d'
2017-07-26 15:35:36 -07:00
A.J. Beamon
d8e308c18f
Enable use of incremental delete when deleting disk queue and sqlite KVS sqlite files.
2017-07-26 14:11:11 -07:00
Yichi Chiang
53e1ae9f60
shard system keyspace
2017-07-26 13:47:31 -07:00
Stephen Atherton
4aaee86c2a
Moved MetricLogger actor to fdbclient so applications other than fdbserver can use it.
2017-07-24 13:13:06 -07:00
Evan Tschannen
2ae445782e
fix: cannot rely on the bestServer’s version because other logs may have higher versions
2017-07-21 19:21:49 -07:00
Evan Tschannen
f6826f1e15
fix: log routers were popped at too high of a version
...
fix: make sure tlogs make everything durable
fix: make cluster controller’s temporary remote log recruitment not interfere with better master exists
2017-07-20 16:26:05 -07:00
Evan Tschannen
7fec378830
do not continue copying data from prior generations after being locked
2017-07-19 15:11:18 -07:00
Evan Tschannen
5852a6301b
fixed even more bugs
2017-07-15 15:15:03 -07:00
Alec Grieser
c860f09d8a
Merge branch 'release-5.0'
2017-07-14 16:01:15 -07:00
Alec Grieser
660729839c
moved Notified.h from flow -> fdbclient ; flow bindings package does better job when excluding testers
2017-07-14 15:49:30 -07:00