Commit Graph

103 Commits

Author SHA1 Message Date
Stephen Atherton 54fc81b260 Improved backup error reporting in backup status. The most recent error for each error type is reported along with how long ago the error occurred, and errors are divided into two categories based on whether or not they occurred since the most recent backup progress. 2018-02-16 19:38:31 -08:00
Evan Tschannen b78e0a362a fix: do not pause when running multiple backup tests simultaneously 2018-01-18 12:24:33 -08:00
Stephen Atherton 93b34a945f Major usability and performance improvements to backup management. Backup descriptions now calculate and display timestamps using TimeKeeper data (if given a cluster) and restorability of snapshots. Expire now requires a --force option to leave a backup unrestorable or unrestorable after a given point in time, specified by version or timestamp. BackupContainerFilesystem now maintains metadata on key version boundaries in order to avoid large list operations for describe and expire operations. Blob parallel recursive list operations can now take a path (aka prefix) filter function. New describe and expire options are available in fdbbackup. 2018-01-17 04:09:43 -08:00
Stephen Atherton b86f68ceb8 Added new test that combines atomic backup/restore. Added randomization to delays in AtomicRestore workload. 2018-01-05 14:43:21 -08:00
Evan Tschannen f8f1c48d83 sometimes test pausing backups 2018-01-04 11:40:08 -08:00
Evan Tschannen 86958cb08d Merge pull request #226 from cie/fix-taskBucket-unblockFuture
Modify TaskBucketCorrectness to support chain and multiple tasks
2017-12-20 18:00:54 -08:00
Yichi Chiang 91e5abeaa6 Modify TaskBucketCorrectness to support chain and multiple tasks 2017-12-20 17:02:49 -08:00
Evan Tschannen 982f0dcb1e Merge pull request #222 from cie/alexmiller/drtimefix2
Fix yet another VersionStamp DR issue.
2017-12-20 15:09:23 -08:00
Stephen Atherton e0d9cea008 Merge branch 'master' into continuous-backup
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
#	fdbrpc/BlobStore.actor.cpp
2017-12-19 23:02:14 -08:00
Alex Miller c7dbd31a1e Refactoring: Create a common prefixRange and do UID->Key once in backup. 2017-12-19 17:17:50 -08:00
Stephen Atherton e28641886d TraceEvent improvements. Minor bug fix, restore log writing tasks didn't have the log file endVersion but it's only for logging purposes. 2017-12-19 15:27:04 -08:00
Evan Tschannen 1dc9eceb6d optimize GetKeyLocationRequests on the proxy so they only require a single map lookup, instead of doing 3 + (3* [number of ranges]) lookups 2017-12-15 20:13:44 -08:00
Stephen Atherton 33f9f1a95c Added SnapshotDispatch task for writing snapshots in random order over a specified period of time and adapting speed to a growing or shrinking database. TaskBucket now supports scheduling tasks. TaskFuture now correctly recognizes multiple tasks in its callback space. TaskBucket extendTimeout() now supports specifying the new timeout version. Submitting a backup now requires a snapshot duration. 2017-12-14 01:44:38 -08:00
Evan Tschannen 7ce93426ed fix: connection disabler in removeServerSafely needs to run for the whole test to avoid getting stuck on include all 2017-12-12 18:38:57 -08:00
Alec Grieser 4495a19299 Merge pull request #220 from cie/alexmiller/flowprofcircus
Add class restrictions to CpuProfiler, and fix metric crash.
2017-12-11 14:13:22 -08:00
Evan Tschannen 73a0a07eac clients ask for key location information directly from the proxy, instead of reading it from the database 2017-12-09 16:10:22 -08:00
Alex Miller 48660e9ce5 Add class restrictions to CpuProfiler, and fix metric crash.
This change largely refactors away the old meaning of the value given to
flow_profiler, which was the number of machines that we'd be profiling, and
instead replaces it with the classes of processes to profile for the duration
of the test.  Most importantly, this means that one can profile in circus with
a configuration that has "ssd" in it, and the circus run will still complete
(as long as the argument isn't "storage").

And also finally add some other fixes I had to the same file to conditionally
change the name of the metric we're looking for to comply with what's actually
written.
2017-12-07 19:28:29 -08:00
Stephen Atherton f8e89a40ac Bug fixes, take(1) is incorrect usage of FlowLock. 2017-12-04 10:25:47 -08:00
Yichi Chiang 8ba0eaebff Check cluster controller using desired process class in consistency check 2017-11-29 15:09:23 -08:00
Stephen Atherton 6695c9e6a2 Bug fixes and improvements to error handling and trace events. The most serious bug was that restore would start at the wrong version, possibly skipping early log and range files. 2017-11-25 00:46:16 -08:00
Stephen Atherton a77162b53d Merge branch 'master' into backup-container-refactor
# Conflicts:
#	fdbclient/BackupAgent.h
#	fdbclient/FileBackupAgent.actor.cpp
#	fdbclient/KeyBackedTypes.h
2017-11-15 08:14:47 -08:00
Stephen Atherton 3dfaf13b67 IBackupContainer has been rewritten to be a logical interface for storing, reading, deleting, expiring, and querying backup data. The details of how the data is organized or stored is now hidden from users of the interface. Both the local and blobstore containers have been rewritten, the key changes being a multi level directory structure and no more use of temporary files or pseudo-symlinks in the blob store implementation. This refactor has a large impact radius as the previous backup container was just a thin wrapper that presented a single level list of files and offered no methods for managing or interpreting the file structure so all of that logic was spread around other places in the code base. This made moving to the new blob store schema very messy, and without this refactor further changes in the future would only be worse.
Several backup tasks have been cleaned up / simplified because they no longer need to manage the ‘raw’ structure of the backup.  The addition of IBackupFile and its finish() method simplified the log and range writer tasks.  Updated BlobStoreEndpoint to support now-required bucket creation and bucket listing prefix/delimiter options for finding common prefixes.  Added KeyBackedSet<T> type.  Moved JSONDoc to its own header.  Added platform::findFilesRecursively().

Still to do:  update command line tool to use new IBackupContainer interface, fix bugs in Restore startup.
2017-11-14 23:33:17 -08:00
Alex Miller 311d1ca87d A variety of fixes that collectively fix using flow profiling in circus.
To run, use --co=flow_profiling=-1, because reasons.
2017-11-07 13:55:16 -08:00
Evan Tschannen 57aba0b3bc fix: excluded servers were the same fitness as storage servers for the master role
fix: better master exists did not considers exclusion for master fitness
2017-11-03 17:09:14 -07:00
Evan Tschannen 54d82c0d92 Merge pull request #194 from cie/alexmiller/valgrind
Fix valgrind errors
2017-10-27 17:25:12 -07:00
Alex Miller e0d33ef8d7 Preemptively fix profiler-related valgrind errors/straight out bugs.
I forgot to initialize some fields in requests.
2017-10-27 17:20:19 -07:00
Balachandar Namasivayam cfefab18fb Merge branch 'master' into add-new-atomic-ops 2017-10-25 18:03:34 -07:00
Balachandar Namasivayam 3d5658940a Addressed Review Comments 2017-10-25 16:42:05 -07:00
Balachandar Namasivayam 9dd588dcce Addressed review comments.
Changed naming for NewMin and NewAnd to MinV2 and AndV2
2017-10-25 14:48:05 -07:00
Balachandar Namasivayam 2f6d55a52f Add correctness tests for all atomic ops 2017-10-25 13:36:49 -07:00
Yichi Chiang c2a117fe07 Merge pull request #189 from cie/enable-check-desired-class
Enable checkUsingDesiredClasses() in consistency check
2017-10-24 15:18:21 -07:00
Yichi Chiang defdc6550d Exclude excluded processses when getting testers 2017-10-24 15:16:34 -07:00
Yichi Chiang 3865c5ae0e Enable checkUsingDesiredClasses() in consistency check 2017-10-24 12:58:54 -07:00
Balachandar Namasivayam 8c3bdc5b3b Make atomic ops differentiate between unset and empty values. 2017-10-23 16:48:13 -07:00
Bhaskar Muppana 360b777b78 Fail with correct error code in case of abort or discontinue of
non-existing backups.
2017-10-18 23:17:48 -07:00
Alec Grieser dd6d8f3b0e Merge branch 'master' into add-new-atomic-ops 2017-10-18 16:36:44 -07:00
Bhaskar Muppana 314511f4d7 Fixing spaces in BackupCorrectness TraceEvents. 2017-10-18 14:27:52 -07:00
Alex Miller 7b9bc1d715 Merge pull request #170 from cie/alexmiller/flowprofile
Add support for profiling a running fdb cluster to fdbcli, fix security issues, and add an improved backtrace.
2017-10-16 16:51:53 -07:00
Alex Miller 91a26a170c Add toggleable profiling support to fdbserver+fdbcli.
This adds the fdbcli commands:
* profile list -- Lists all workers in a way that doesn't fill `kill`'s list.
* profile flow run -- Allows starting flow profiling on a set of hosts for a specified interval.

And threads through all the support for enabling and disabling profiling as an RPC.
2017-10-16 16:05:02 -07:00
Bhaskar Muppana d1e9d28239 Backup log messages. 2017-10-12 16:12:42 -07:00
Stephen Atherton 11517f7bfc Merge branch 'master' into continuous-backup
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
2017-10-12 11:03:23 -07:00
Balachandar Namasivayam 8e0bea2795 Update API_VERSION from 500 to 510 2017-10-11 13:49:38 -07:00
Evan Tschannen 8feb3b8fbc fixed conflict range workload by just disabling timeKeeper instead of the check, because it should be a more robust fix 2017-10-10 16:01:02 -07:00
Balachandar Namasivayam eeebf10030 Modified existing behavior of MIN and AND atomic ops. The new behavior results in a 'SET' if the atomic op is performed on a non -existing key.
Added new atomic ops ByteMin and ByteMax that does lexicographic comparison of byte strings.
2017-10-10 13:02:22 -07:00
Evan Tschannen c8525dc3e7 timekeeper is constantly changing keys in the system keyspace, so do not report errors on key mismatches on keys in the system keyspace 2017-10-10 12:04:56 -07:00
Alex Miller a21c8a820b Move cpuProfilerRequest from WorkerInterface to ClientWorkerInterface.
A way to access this stream is required if we wish to be able to toggle
profiling from fdbcli.  There's two ways to do this:

1. Use `monitorLeader()` to get a `ClusterControllerFullInterface`, and use
`getWorkers` from there to get a list of `WorkerInterface`s, from which we can
access cpuProfilerRequest.
2. Move cpuProfilerRequest to ClientWorkerInterface and use the existing code
in the client that can fetch a list of all `ClientWorkerInterface`s.

The split between WorkerInterface and ClientWorkerInterface appears to be
what a client might have a need to call versus what is fdbserver-internal (and
thus no client should even want to call). Thus, it seems to make more sense to
acknowledge that profiling is useful to be able to toggle from a client, and go
with option (2).
2017-10-05 14:08:28 -07:00
Alex Miller 11668bb359 Fixing code review comments. 2017-09-29 15:58:36 -07:00
Alex Miller b7ce9d996c Comment out verbose TraceEvents in preparation for pushing. 2017-09-29 15:58:36 -07:00
Alex Miller c40c1bb5fe Add a new workload: BackupToDBAbort, which does an ACI switchover.
This is to allower easier testing of non-durable switchovers without having to
wiggle into BackupToDBCorrectness's view of the world.
2017-09-29 15:58:36 -07:00
Alex Miller 9e9a96ae76 Make VersionStamp workload able to run with DR-style workloads.
* It is now tolerant of locked database errors, and handles them correctly.
* There is an option to specify which database to verify against.
2017-09-29 15:58:36 -07:00