Commit Graph

831 Commits

Author SHA1 Message Date
Alex Miller f19cb3bbbd Merge pull request #208 from cie/alexmiller/grvtfix
Fix the GRV performance regression
2017-11-17 15:00:44 -08:00
A.J. Beamon 3ded271153 Dispose of Cluster objects in fdb.open() 2017-11-17 12:21:14 -08:00
Alec Grieser f657be8136 add a space to match the bracing style used in this file 2017-11-17 09:55:11 -08:00
A.J. Beamon 0981e0dcdd Dispose of newly created transactions if transfer() fails. 2017-11-17 09:47:17 -08:00
Bhaskar Muppana 1bf84cd51a Merge pull request #210 from bmuppana/backup-logs
Adding TraceEvents for BackupRangeTask.
2017-11-16 19:12:04 -08:00
Bhaskar Muppana 5e596ea670 Adding TraceEvents for BackupRangeTask. 2017-11-16 19:11:31 -08:00
Yichi Chiang d9a98aa968 Remove commented code 2017-11-16 17:25:37 -08:00
Evan Tschannen 7211a397b0 Merge pull request #209 from cie/fix-double-recoveries
Fix double recoveries
2017-11-16 17:16:39 -08:00
Yichi Chiang 0d5dc15ac8 Fix double recoveries 2017-11-16 16:58:55 -08:00
Stephen Atherton 07c19098fe Improved backup container unit test, added file reading / verification, more data, and a series of expirations and validating the expected result. Then fixed the bugs that this new testing discovered. 2017-11-16 16:19:56 -08:00
Alex Miller e9412bbb11 Fix the GRV performance regression introduced by adding the policy engine to GRV calculations.
Construction of LocalityGroup from LocalityData is expensive, and the previous
code greatly ran afoul of that.  The policy engine does a large amount of
interning of strings and building compressed maps to make the expected many
future selectReplica calls cheap.  Unfortunately we don't call selectReplicas,
so much of this work is undesireable for us, and a large amount of CPU time is
spent doing this initialization work.

The new changes aggressively do the minimal LocalityGroup::add() calls
necessary, and make them as cheap as possibly by removing all elements from
LocalityData that don't need to be considered by the policy.

This optimization was also applied to the PeekCursor used during recovery,
which should speed recoveries up by a small amount.
2017-11-16 16:15:52 -08:00
A.J. Beamon 5b5850e097 The dispose in Database.createTransaction was supposed to happen on error, not in the finally block 2017-11-16 10:50:13 -08:00
Stephen Atherton f105204aca Shifted version distribution over folders. 2017-11-15 23:13:04 -08:00
Stephen Atherton cc47d0e161 Bug fix in restore dispatch, begin file was not being incremented. Removed try/catch because the inherited handleError() is better. 2017-11-15 22:38:31 -08:00
Alex Miller e900333dbf Fix a subtle valgrind error.
If there was an error in waiting for the read version, we would attempt to
serialize and eventually commit a CommitTransactionRef that had an
uninitialized read_snapshot.
2017-11-15 19:21:20 -08:00
Evan Tschannen ad456a939a Merge pull request #206 from cie/change-excluded-cluster-controller
Change excluded cluster controller
2017-11-15 17:28:33 -08:00
Yichi Chiang f96faf72d9 Add fullyRecoveredConfig for checking exclusions 2017-11-15 17:15:24 -08:00
A.J. Beamon db017317ac Update the Java bindings to call add missing dispose calls. 2017-11-15 15:56:50 -08:00
Stephen Atherton ab0017f023 TaskBucket’s TaskFunc interface now has an optional handleError() which is called on any task that throws an error from execute() or finish(). Restore and Backup tasks use this to ensure that any errors that occur are placed in the backup or restore config’s lastError property. Bug fixes in log and range file encodings. 2017-11-15 13:33:09 -08:00
A.J. Beamon 02a9978612 Merge pull request #204 from cie/fdbcli-warn-incompatible-peers
Added a counter to keep track of active outgoing incompatible connect…
2017-11-15 12:52:01 -08:00
Evan Tschannen 30464e943c Merge pull request #205 from cie/cleanup-spammy-traceevents
Cleanup spammy traceevents
2017-11-15 12:41:37 -08:00
Evan Tschannen e113dba0e3 added a new trace event tracking master recovery durations 2017-11-15 12:38:26 -08:00
Stephen Atherton a77162b53d Merge branch 'master' into backup-container-refactor
# Conflicts:
#	fdbclient/BackupAgent.h
#	fdbclient/FileBackupAgent.actor.cpp
#	fdbclient/KeyBackedTypes.h
2017-11-15 08:14:47 -08:00
Stephen Atherton e07dcb9ada Fixed header paths. 2017-11-15 00:05:20 -08:00
Stephen Atherton 3dfaf13b67 IBackupContainer has been rewritten to be a logical interface for storing, reading, deleting, expiring, and querying backup data. The details of how the data is organized or stored is now hidden from users of the interface. Both the local and blobstore containers have been rewritten, the key changes being a multi level directory structure and no more use of temporary files or pseudo-symlinks in the blob store implementation. This refactor has a large impact radius as the previous backup container was just a thin wrapper that presented a single level list of files and offered no methods for managing or interpreting the file structure so all of that logic was spread around other places in the code base. This made moving to the new blob store schema very messy, and without this refactor further changes in the future would only be worse.
Several backup tasks have been cleaned up / simplified because they no longer need to manage the ‘raw’ structure of the backup.  The addition of IBackupFile and its finish() method simplified the log and range writer tasks.  Updated BlobStoreEndpoint to support now-required bucket creation and bucket listing prefix/delimiter options for finding common prefixes.  Added KeyBackedSet<T> type.  Moved JSONDoc to its own header.  Added platform::findFilesRecursively().

Still to do:  update command line tool to use new IBackupContainer interface, fix bugs in Restore startup.
2017-11-14 23:33:17 -08:00
Balachandar Namasivayam 987379d790 Changed naming of num_incompatible_connections to numIncompatibleConnections 2017-11-14 18:37:29 -08:00
Yichi Chiang df922bc973 Change excluded cluster controller 2017-11-14 13:57:37 -08:00
Balachandar Namasivayam 986b73f458 Fixed an issue where an ACTOR outlives an object passed to it and then crashes while accessing it. 2017-11-14 13:51:23 -08:00
A.J. Beamon bb1297c686 Remove RkServerQueueInfo and RkTLogQueueInfo trace events, since this information is more or less already logged on the storage servers and tlogs. Update the quiet database check and magnesium to use the information from the logs and storage servers. 2017-11-14 12:59:42 -08:00
A.J. Beamon 3b952efb4e Remove events from cluster controller that get logged for roughly every worker upon recovery, master registration, etc. 2017-11-14 10:15:45 -08:00
Balachandar Namasivayam 27b67cffbe The earlier implementation of tracking number of incompatible connection had a bug where the counter will be incorrectly decremented for incoming connections on certain conditions.
Now the counter increment and decrement happens in the same ACTOR (ConnecitonReader) and makes it easy to verify its correctness.
2017-11-13 15:07:39 -08:00
A.J. Beamon 313e823629 Delete TDMetric data (tmpEventMetric) when a trace event is throttled. 2017-11-13 15:06:21 -08:00
A.J. Beamon 0fea5e9c2f Convert client_invalid_operation errors to ASSERTs. 2017-11-13 11:38:34 -08:00
A.J. Beamon cd085764f1 Do not automatically change a cluster file that does not match what you expect. 2017-11-10 14:12:45 -08:00
Balachandar Namasivayam 486d089e98 Better message is displayed to the user. 2017-11-09 13:55:19 -08:00
Balachandar Namasivayam 9809e84806 Added a counter to keep track of active outgoing incompatible connections.
This counter is used to print a warning in fdbcli if there are incompatible peers.

Example Output:

./fdbcli
Using cluster file `fdb.cluster'.

WARNING: Incompatible peers exist.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status

WARNING: Incompatible peers exist.

Using cluster file `fdb.cluster'.

Could not communicate with a quorum of coordination servers:
  127.0.0.1:4000  (unreachable)
2017-11-09 11:20:35 -08:00
Alex Miller 311d1ca87d A variety of fixes that collectively fix using flow profiling in circus.
To run, use --co=flow_profiling=-1, because reasons.
2017-11-07 13:55:16 -08:00
Alec Grieser f53afd615e Merge pull request #203 from cie/update-old-kernel-memavailable-computation
Changes to MemAvailable computation on kernels without MemAvailable
2017-11-06 18:30:31 -08:00
A.J. Beamon d174e05bac Merge pull request #180 from cie/bindings-versionstamps-in-tuples
<rdar://problem/25560444> [Feature] Versionstamped keys and tuple/directory incompatibility
2017-11-06 16:39:17 -08:00
A.J. Beamon fee6734e71 Add braces around multiline if block 2017-11-06 16:38:32 -08:00
Alec Grieser 396434794d some python versionstamp api tweaks 2017-11-06 14:56:41 -08:00
Stephen Atherton 7a0aa7b0aa Merge pull request #202 from cie/fdbcli-status-log-backup-dr-info
Fdbcli status log backup dr info
2017-11-06 14:55:20 -08:00
A.J. Beamon fefbf22b6f Remove redundant tagCount 2017-11-06 14:00:08 -08:00
A.J. Beamon b1fe3d1b55 Delete spurious spaces 2017-11-06 12:59:00 -08:00
A.J. Beamon bf07fa3023 Untested changes to MemAvailable computation on kernels without MemAvailable 2017-11-06 09:35:05 -08:00
A.J. Beamon 7d63c2cfb7 Formatting fixes 2017-11-06 09:20:31 -08:00
A.J. Beamon 839ddd4d2e Merge pull request #201 from cie/bindings-java-get-range-first-chunk
<rdar://problem/35325444> Java Bindings: getRange shouldn't fetch first chunk twice when using asList
2017-11-06 09:09:52 -08:00
Evan Tschannen 150aca4734 Merge pull request #199 from cie/getrange-perf-improvements
Getrange perf improvements
2017-11-04 14:25:55 -07:00
Evan Tschannen 33eef4c76b Merge branch 'master' into getrange-perf-improvements 2017-11-04 14:25:45 -07:00
Evan Tschannen 706bf1e018 fix: we cannot trigger better master exists before a master is fully recovered because exclusions changed by the provisional master will not be committed until the master is fully recovered 2017-11-04 12:48:04 -07:00