Commit Graph

688 Commits

Author SHA1 Message Date
Stephen Atherton aeebe711ce TaskBucket’s saveAndExtend() is now accomplished through extendTimeout() with an option to save parameters. SaveAndExtendIncrementally() has been removed as it is no longer needed because TaskBucket’s normal execution loop calls extendTimeout() periodically as long as the TaskFunc’s execute() actor has not finished or thrown. If a TaskFunc wants to save changes to task parameters to checkpoint progress for task restarts to benefit from it can call extendTimeout() explicitly with the updateParams flag set to true. 2017-11-30 17:18:57 -08:00
Stephen Atherton 1e643239f9 Improvement in blob connnection reuse, oldest connnections in pool are now used first. 2017-11-30 12:57:29 -08:00
Stephen Atherton 39edda1804 Bug fix, and some code cleanup along the way. If a range backup task dies in finish() the re-run of the task will start at begin == end, which wasn’t being handled correctly. 2017-11-27 15:57:19 -08:00
Stephen Atherton d9c2f6d705 Bug fix. The terminator argument of readCommitted() previously did nothing, and end_of_stream() was always sent to the output stream. The parameter was fixed to enable changing this behavior but original the behavior was not being correctly preserved in at least one case. 2017-11-26 22:52:47 -08:00
Stephen Atherton 9ce9fd8692 Added comments to describe IBackupFile contract. 2017-11-26 22:02:14 -08:00
Stephen Atherton 1d3af8f4f0 Bug fix. 2017-11-25 21:13:56 -08:00
Stephen Atherton 1b1c8e985a Merge branch 'master' into backup-container-refactor
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
2017-11-25 19:54:51 -08:00
Stephen Atherton 6695c9e6a2 Bug fixes and improvements to error handling and trace events. The most serious bug was that restore would start at the wrong version, possibly skipping early log and range files. 2017-11-25 00:46:16 -08:00
Stephen Atherton 3449bc4cdc Bug fix, range end was wrong for final range file of backup range task. 2017-11-19 04:44:33 -08:00
Stephen Atherton a31216f3f7 Added toString() to Backup/Restore TaskFunc interface so tasks can provide a method to describe important task parameters for the default handleError() methods to use. 2017-11-19 04:39:18 -08:00
Stephen Atherton 32903ffa77 Trace event improvements and severity changes. 2017-11-19 04:34:28 -08:00
Stephen Atherton 9354a8cbb4 Added new backup container method to list everything in a backup. 2017-11-19 04:28:22 -08:00
Evan Tschannen f9efdf1fc1 fix: typeString was not static, so it added a lot of memory to MutationRef 2017-11-17 23:36:09 -08:00
Alex Miller f19cb3bbbd Merge pull request #208 from cie/alexmiller/grvtfix
Fix the GRV performance regression
2017-11-17 15:00:44 -08:00
Bhaskar Muppana 1bf84cd51a Merge pull request #210 from bmuppana/backup-logs
Adding TraceEvents for BackupRangeTask.
2017-11-16 19:12:04 -08:00
Bhaskar Muppana 5e596ea670 Adding TraceEvents for BackupRangeTask. 2017-11-16 19:11:31 -08:00
Yichi Chiang d9a98aa968 Remove commented code 2017-11-16 17:25:37 -08:00
Evan Tschannen 7211a397b0 Merge pull request #209 from cie/fix-double-recoveries
Fix double recoveries
2017-11-16 17:16:39 -08:00
Yichi Chiang 0d5dc15ac8 Fix double recoveries 2017-11-16 16:58:55 -08:00
Stephen Atherton 07c19098fe Improved backup container unit test, added file reading / verification, more data, and a series of expirations and validating the expected result. Then fixed the bugs that this new testing discovered. 2017-11-16 16:19:56 -08:00
Alex Miller e9412bbb11 Fix the GRV performance regression introduced by adding the policy engine to GRV calculations.
Construction of LocalityGroup from LocalityData is expensive, and the previous
code greatly ran afoul of that.  The policy engine does a large amount of
interning of strings and building compressed maps to make the expected many
future selectReplica calls cheap.  Unfortunately we don't call selectReplicas,
so much of this work is undesireable for us, and a large amount of CPU time is
spent doing this initialization work.

The new changes aggressively do the minimal LocalityGroup::add() calls
necessary, and make them as cheap as possibly by removing all elements from
LocalityData that don't need to be considered by the policy.

This optimization was also applied to the PeekCursor used during recovery,
which should speed recoveries up by a small amount.
2017-11-16 16:15:52 -08:00
Stephen Atherton f105204aca Shifted version distribution over folders. 2017-11-15 23:13:04 -08:00
Stephen Atherton cc47d0e161 Bug fix in restore dispatch, begin file was not being incremented. Removed try/catch because the inherited handleError() is better. 2017-11-15 22:38:31 -08:00
Alex Miller e900333dbf Fix a subtle valgrind error.
If there was an error in waiting for the read version, we would attempt to
serialize and eventually commit a CommitTransactionRef that had an
uninitialized read_snapshot.
2017-11-15 19:21:20 -08:00
Evan Tschannen ad456a939a Merge pull request #206 from cie/change-excluded-cluster-controller
Change excluded cluster controller
2017-11-15 17:28:33 -08:00
Yichi Chiang f96faf72d9 Add fullyRecoveredConfig for checking exclusions 2017-11-15 17:15:24 -08:00
Stephen Atherton ab0017f023 TaskBucket’s TaskFunc interface now has an optional handleError() which is called on any task that throws an error from execute() or finish(). Restore and Backup tasks use this to ensure that any errors that occur are placed in the backup or restore config’s lastError property. Bug fixes in log and range file encodings. 2017-11-15 13:33:09 -08:00
A.J. Beamon 02a9978612 Merge pull request #204 from cie/fdbcli-warn-incompatible-peers
Added a counter to keep track of active outgoing incompatible connect…
2017-11-15 12:52:01 -08:00
Evan Tschannen 30464e943c Merge pull request #205 from cie/cleanup-spammy-traceevents
Cleanup spammy traceevents
2017-11-15 12:41:37 -08:00
Evan Tschannen e113dba0e3 added a new trace event tracking master recovery durations 2017-11-15 12:38:26 -08:00
Stephen Atherton a77162b53d Merge branch 'master' into backup-container-refactor
# Conflicts:
#	fdbclient/BackupAgent.h
#	fdbclient/FileBackupAgent.actor.cpp
#	fdbclient/KeyBackedTypes.h
2017-11-15 08:14:47 -08:00
Stephen Atherton e07dcb9ada Fixed header paths. 2017-11-15 00:05:20 -08:00
Stephen Atherton 3dfaf13b67 IBackupContainer has been rewritten to be a logical interface for storing, reading, deleting, expiring, and querying backup data. The details of how the data is organized or stored is now hidden from users of the interface. Both the local and blobstore containers have been rewritten, the key changes being a multi level directory structure and no more use of temporary files or pseudo-symlinks in the blob store implementation. This refactor has a large impact radius as the previous backup container was just a thin wrapper that presented a single level list of files and offered no methods for managing or interpreting the file structure so all of that logic was spread around other places in the code base. This made moving to the new blob store schema very messy, and without this refactor further changes in the future would only be worse.
Several backup tasks have been cleaned up / simplified because they no longer need to manage the ‘raw’ structure of the backup.  The addition of IBackupFile and its finish() method simplified the log and range writer tasks.  Updated BlobStoreEndpoint to support now-required bucket creation and bucket listing prefix/delimiter options for finding common prefixes.  Added KeyBackedSet<T> type.  Moved JSONDoc to its own header.  Added platform::findFilesRecursively().

Still to do:  update command line tool to use new IBackupContainer interface, fix bugs in Restore startup.
2017-11-14 23:33:17 -08:00
Balachandar Namasivayam 987379d790 Changed naming of num_incompatible_connections to numIncompatibleConnections 2017-11-14 18:37:29 -08:00
Yichi Chiang df922bc973 Change excluded cluster controller 2017-11-14 13:57:37 -08:00
Balachandar Namasivayam 986b73f458 Fixed an issue where an ACTOR outlives an object passed to it and then crashes while accessing it. 2017-11-14 13:51:23 -08:00
A.J. Beamon bb1297c686 Remove RkServerQueueInfo and RkTLogQueueInfo trace events, since this information is more or less already logged on the storage servers and tlogs. Update the quiet database check and magnesium to use the information from the logs and storage servers. 2017-11-14 12:59:42 -08:00
A.J. Beamon 3b952efb4e Remove events from cluster controller that get logged for roughly every worker upon recovery, master registration, etc. 2017-11-14 10:15:45 -08:00
Balachandar Namasivayam 27b67cffbe The earlier implementation of tracking number of incompatible connection had a bug where the counter will be incorrectly decremented for incoming connections on certain conditions.
Now the counter increment and decrement happens in the same ACTOR (ConnecitonReader) and makes it easy to verify its correctness.
2017-11-13 15:07:39 -08:00
A.J. Beamon 313e823629 Delete TDMetric data (tmpEventMetric) when a trace event is throttled. 2017-11-13 15:06:21 -08:00
A.J. Beamon 0fea5e9c2f Convert client_invalid_operation errors to ASSERTs. 2017-11-13 11:38:34 -08:00
A.J. Beamon cd085764f1 Do not automatically change a cluster file that does not match what you expect. 2017-11-10 14:12:45 -08:00
Balachandar Namasivayam 486d089e98 Better message is displayed to the user. 2017-11-09 13:55:19 -08:00
Balachandar Namasivayam 9809e84806 Added a counter to keep track of active outgoing incompatible connections.
This counter is used to print a warning in fdbcli if there are incompatible peers.

Example Output:

./fdbcli
Using cluster file `fdb.cluster'.

WARNING: Incompatible peers exist.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status

WARNING: Incompatible peers exist.

Using cluster file `fdb.cluster'.

Could not communicate with a quorum of coordination servers:
  127.0.0.1:4000  (unreachable)
2017-11-09 11:20:35 -08:00
Alex Miller 311d1ca87d A variety of fixes that collectively fix using flow profiling in circus.
To run, use --co=flow_profiling=-1, because reasons.
2017-11-07 13:55:16 -08:00
Alec Grieser f53afd615e Merge pull request #203 from cie/update-old-kernel-memavailable-computation
Changes to MemAvailable computation on kernels without MemAvailable
2017-11-06 18:30:31 -08:00
A.J. Beamon d174e05bac Merge pull request #180 from cie/bindings-versionstamps-in-tuples
<rdar://problem/25560444> [Feature] Versionstamped keys and tuple/directory incompatibility
2017-11-06 16:39:17 -08:00
A.J. Beamon fee6734e71 Add braces around multiline if block 2017-11-06 16:38:32 -08:00
Alec Grieser 396434794d some python versionstamp api tweaks 2017-11-06 14:56:41 -08:00
Stephen Atherton 7a0aa7b0aa Merge pull request #202 from cie/fdbcli-status-log-backup-dr-info
Fdbcli status log backup dr info
2017-11-06 14:55:20 -08:00