foundationdb

Commit Graph

Author	SHA1	Message	Date
A.J. Beamon	2744646090	Merge branch 'release-5.0' into release-5.1	2018-01-22 11:57:58 -08:00
A.J. Beamon	188562ccbc	fix: Status should create its DatabaseConfiguration using fromKeyValues(). This makes sure that various state is correctly set if not specified in the configuration.	2018-01-22 11:40:08 -08:00
A.J. Beamon	35b91bfb55	Add back (in different form) some ratekeeper trace events when a storage server or log doesn't respond. Add actualTPS (named TPSBasis) to RkUpdate.	2018-01-18 14:51:38 -08:00
Evan Tschannen	b78e0a362a	fix: do not pause when running multiple backup tests simultaneously	2018-01-18 12:24:33 -08:00
Stephen Atherton	93b34a945f	Major usability and performance improvements to backup management. Backup descriptions now calculate and display timestamps using TimeKeeper data (if given a cluster) and restorability of snapshots. Expire now requires a --force option to leave a backup unrestorable or unrestorable after a given point in time, specified by version or timestamp. BackupContainerFilesystem now maintains metadata on key version boundaries in order to avoid large list operations for describe and expire operations. Blob parallel recursive list operations can now take a path (aka prefix) filter function. New describe and expire options are available in fdbbackup.	2018-01-17 04:09:43 -08:00
Evan Tschannen	645dc5ead6	warmRange needs to get a read version occasionally to prevent it from overwhelming the proxy quietDatabase waits for all data distribution to be completely finished so that databases are cached in a cleaner state	2018-01-14 12:50:52 -08:00
Evan Tschannen	be643d6937	fix: the tlog did not cancel recovery properly when stopped	2018-01-12 17:18:14 -08:00
Evan Tschannen	3915d6825c	we need to check the server list at a higher priority, because if we do not notice a storage server interface change for a long period of time, we will mark it as failed	2018-01-12 12:51:07 -08:00
Evan Tschannen	de119f192d	fixed a priority inversion where the tlog would prefer to copy data from the previous generation rather than make data durable (leading to being ratekeeper controlled)	2018-01-11 16:09:49 -08:00
Evan Tschannen	29ebb19388	Merge branch 'release-5.0' into release-5.1	2018-01-11 15:43:37 -08:00
Evan Tschannen	22e5a0b257	formatting	2018-01-11 14:44:09 -08:00
Evan Tschannen	173a8de3ed	DBCoreState supports upgrades from 3.0 versions	2018-01-11 14:39:51 -08:00
A.J. Beamon	2f5073d00f	Some visual studio project cleanup.	2018-01-10 10:07:18 -08:00
Evan Tschannen	022df3b91b	backup and restore sometimes took too long in simulation	2018-01-09 17:26:42 -08:00
Evan Tschannen	645f68212b	make timekeeper priority system immediate	2018-01-08 18:21:00 -08:00
Evan Tschannen	370e8a9903	fix: split metrics could fail an assert in a very rare scenario	2018-01-08 18:20:22 -08:00
Stephen Atherton	b86f68ceb8	Added new test that combines atomic backup/restore. Added randomization to delays in AtomicRestore workload.	2018-01-05 14:43:21 -08:00
A.J. Beamon	5015119115	Generalize the message that gets displayed in status if a cluster file's contents are incorrect.	2018-01-05 10:29:47 -08:00
Evan Tschannen	e11f461cbd	fix: better master exists needs to check master fitness before tlogs or proxies because that is the order of recruitment	2018-01-04 15:19:46 -08:00
Evan Tschannen	f8f1c48d83	sometimes test pausing backups	2018-01-04 11:40:08 -08:00
Evan Tschannen	f2c4beed9f	fix: tlogFitness did not consider it better to have one tlog of a better fitness fix: checkStable was not used in all places in better master exists fix: we need to call checkOutstanding on worker registration in all cases fix: in case persistentData is keyValueStoreMemory, we need to make sure it is fully recovered before writing to it	2018-01-04 11:33:02 -08:00
Evan Tschannen	6d5dd9bd27	fix: we cannot pipeline disk queue commits until after the first commit is successful	2018-01-02 13:30:27 -08:00
Evan Tschannen	86958cb08d	Merge pull request #226 from cie/fix-taskBucket-unblockFuture Modify TaskBucketCorrectness to support chain and multiple tasks	2017-12-20 18:00:54 -08:00
Yichi Chiang	91e5abeaa6	Modify TaskBucketCorrectness to support chain and multiple tasks	2017-12-20 17:02:49 -08:00
Alex Miller	f70e3b9fe8	Add or change a bunch of comments to provide descriptions of function contracts. This cleans up a bit of the VersionStamp DR work I did, and leaves hints and advice for anyone who will be touching mutation applying code in the future.	2017-12-20 16:57:14 -08:00
Evan Tschannen	982f0dcb1e	Merge pull request #222 from cie/alexmiller/drtimefix2 Fix yet another VersionStamp DR issue.	2017-12-20 15:09:23 -08:00
Alex Miller	b5a6bc0ab7	Fix VersionStamp problems by instead adding a COMMIT_ON_FIRST_PROXY transaction option. Simulation identified the fact that we can violate the VersionStamps-are-always-increasing promise via the following series of events: 1. On proxy 0, dumpData adds commit requests to proxy 0's commit promise stream 2. To any proxy, a client submits the first transaction of abortBackup, which stops further dumpData calls on proxy 0. 3. To any proxy that is not proxy 0, submit a transaction that checks if it needs to upgrade the destination version. 4. The transaction from (3) is committed 5. Transactions from (1) are committed This is possible because the dumpData transactions have no read conflict ranges, and thus it's impossible to make them abort due to "conflicting" transactions. There's also no promise that if client C sends a commit to proxy A, and later a client D sends a commit to proxy B, that B must log its commit after A. (We only promise that if C is told it was committed before D is told it was committed, then A committed before B.) There was a failed attempt to fix this problem. We tried to add read conflict ranges to dumpData transactions so that they could be aborted by "conflicting" transactions. However, this failed because this now means that dumpData transactions require conflict resolution, and the stale read version that they use can cause them to be aborted with a transaction_too_old error. (Transactions that don't have read conflict ranges will never return transaction_too_old, because with no reads, the read snapshot version is effectively meaningless.) This was never previously possible, so the existing code doesn't retry commits, and to make things more complicated, the dumpData commits must be applied in order. This would require either adding dependencies to transactions (if A is going to commit then B must also be/have committed), which would be complicated, or submitting transactions with a fixed read version, and replaying the failed commits with a higher read version once we get a transaction_too_old error, which would unacceptably slow down the maximum throughput of dumpData. Thus, we've instead elected to add a special transaction option that bypasses proxy load balancing for commits, and always commits against proxy 0. We can know for certain that after the transaction from (2) is committed, all of the dumpData transactions that will be committed have been added to the commit promise stream on proxy 0. Thus, if we enqueue another transaction against proxy 0, we can know that it will be placed into the promise stream after all of the dumpData transactions, thus providing the semantics that we require: no dumpData transaction can commit after the destination version upgrade transaction.	2017-12-20 15:04:04 -08:00
Stephen Atherton	e0d9cea008	Merge branch 'master' into continuous-backup # Conflicts: # fdbclient/FileBackupAgent.actor.cpp # fdbrpc/BlobStore.actor.cpp	2017-12-19 23:02:14 -08:00
Alex Miller	c7dbd31a1e	Refactoring: Create a common prefixRange and do UID->Key once in backup.	2017-12-19 17:17:50 -08:00
Alex Miller	1488c12c18	Simulation will return and error and print if any non-suppressed SevError events were logged. This means that loops like `seed=1; while ./fdbserver -r simulation -s $seed; do seed=$(($seed+1)); done` to find an example of an often failing test. This also means joshua will report ExitCode errors on anything that has a SevError in the log. As a part of this, we also implicitly downgrade any injected errors to SevWarnAlways.	2017-12-19 17:17:50 -08:00
Stephen Atherton	e28641886d	TraceEvent improvements. Minor bug fix, restore log writing tasks didn't have the log file endVersion but it's only for logging purposes.	2017-12-19 15:27:04 -08:00
Evan Tschannen	a5601877b3	fix: valgrind issue with destruction ordering	2017-12-18 15:31:59 -08:00
Evan Tschannen	1dc9eceb6d	optimize GetKeyLocationRequests on the proxy so they only require a single map lookup, instead of doing 3 + (3* [number of ranges]) lookups	2017-12-15 20:13:44 -08:00
Stephen Atherton	33f9f1a95c	Added SnapshotDispatch task for writing snapshots in random order over a specified period of time and adapting speed to a growing or shrinking database. TaskBucket now supports scheduling tasks. TaskFuture now correctly recognizes multiple tasks in its callback space. TaskBucket extendTimeout() now supports specifying the new timeout version. Submitting a backup now requires a snapshot duration.	2017-12-14 01:44:38 -08:00
Evan Tschannen	7ce93426ed	fix: connection disabler in removeServerSafely needs to run for the whole test to avoid getting stuck on include all	2017-12-12 18:38:57 -08:00
Alec Grieser	4495a19299	Merge pull request #220 from cie/alexmiller/flowprofcircus Add class restrictions to CpuProfiler, and fix metric crash.	2017-12-11 14:13:22 -08:00
Evan Tschannen	73a0a07eac	clients ask for key location information directly from the proxy, instead of reading it from the database	2017-12-09 16:10:22 -08:00
Alex Miller	48660e9ce5	Add class restrictions to CpuProfiler, and fix metric crash. This change largely refactors away the old meaning of the value given to flow_profiler, which was the number of machines that we'd be profiling, and instead replaces it with the classes of processes to profile for the duration of the test. Most importantly, this means that one can profile in circus with a configuration that has "ssd" in it, and the circus run will still complete (as long as the argument isn't "storage"). And also finally add some other fixes I had to the same file to conditionally change the name of the metric we're looking for to comply with what's actually written.	2017-12-07 19:28:29 -08:00
Stephen Atherton	abb2dd1ebc	Merge pull request #214 from cie/alexmiller/fallocate Use fallocate to zero ranges instead of writing zeroes	2017-12-06 13:47:40 -08:00
Evan Tschannen	5a947212ed	fix: ensure all prior commits have completed before returning that a commit has committed from the disk queue	2017-12-06 12:31:07 -08:00
Stephen Atherton	f8e89a40ac	Bug fixes, take(1) is incorrect usage of FlowLock.	2017-12-04 10:25:47 -08:00
Evan Tschannen	49dac11a5f	added a SevWarnAlways for when a disk queue file grows larger than 20GB	2017-12-01 15:05:17 -08:00
Evan Tschannen	482ac38ca6	added knobs so that the client failure monitoring update rate and the server failure monitoring update rate are separate knobs	2017-12-01 13:04:32 -08:00
Evan Tschannen	c3918d892a	do not use bandwidth splitting on the keyServer shard, lots of sets and clears to this shard generally means you do not want to create additional data distribution work	2017-11-30 18:28:16 -08:00
Alex Miller	196258080b	Refactor zeroing a chunk of a file from DiskQueue into IAsyncFile. If we're going to do the work to provide more optimized ways to zero files, then I'd feel better with this being in a more common place, so that any other zero-ers are likely to reuse it. It also makes testing easier/more obvious. Also, because it's needed for correctness, fix the aligned_alloc for OSX, which wasn't aligned, and use an actually aligned allocation function.	2017-11-30 17:57:55 -08:00
Alex Miller	c7a120c59d	Rename IAsyncFile::incrementalDelete -> IAsyncFileSystem::incrementalDeleteFile. `deleteFile` existed in IAsyncFileSystem, so an incremental delete function seems to belong more as a virtual method on IAsyncFileSystem than a static method on IAsyncFile, and the naming should match. As long as we're here, change IAsyncFile to declare a virtual destructor, so that it has good and proper C++ behavior. I presume this is what was vaguely intended by the default constructor definition that previously existed?	2017-11-30 17:19:10 -08:00
Evan Tschannen	7f72aa7de5	fix: a storage server does not ever need to rollback before a version restored from disk	2017-11-30 11:19:43 -08:00
Evan Tschannen	e5a682948c	Merge pull request #212 from cie/check-cluster-controller-desired-class Check cluster controller using desired process class in consistency c…	2017-11-29 15:57:51 -08:00
Yichi Chiang	8ba0eaebff	Check cluster controller using desired process class in consistency check	2017-11-29 15:09:23 -08:00
Evan Tschannen	8c51bc4ac4	fixed low latency tests in a way that gives us better test coverage	2017-11-28 18:20:29 -08:00

1 2 3 4 5 ...

266 Commits