Commit Graph

189 Commits

Author SHA1 Message Date
Evan Tschannen 21482a45e1 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DBCoreState.h
#	fdbserver/LogSystem.h
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/TLogServer.actor.cpp
2018-01-14 13:40:24 -08:00
Alvin Moore 2e6ce03224 Merge pull request #232 from cie/build-dont-compile-hpp
Filter out .hpp files from *_BUILD_SOURCES (like we do with .h files)…
2018-01-12 14:09:25 -08:00
Evan Tschannen 02bd83ff76 changed incompatibleDataRead to an asyncTrigger 2018-01-11 13:35:56 -08:00
A.J. Beamon 80b84c23ac Filter out .hpp files from *_BUILD_SOURCES (like we do with .h files). Add xml2json.hpp to our fdbrpc project. 2018-01-10 13:51:57 -08:00
A.J. Beamon ce93d98b50 Temporarily remove xml2json.hpp from fdbrpc vcxproj 2018-01-10 10:18:44 -08:00
A.J. Beamon 2f5073d00f Some visual studio project cleanup. 2018-01-10 10:07:18 -08:00
Stephen Atherton 0e7d538c94 Bug fix, in recursive blob folder listings the recent removal of common prefixes from the result stream caused the list marker to not be set correctly when a folder level requires multiple requests due to folder size. 2018-01-06 20:58:48 -08:00
Evan Tschannen 3ec45d38a0 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-06 13:54:45 -08:00
Stephen Atherton 96cb06cbc7 Bug fixes. Fdbbackup delete was broken. Blobstore backup container deletion would do too much listing before deletions began due to list operations queueing up ahead of and starving the delete operations. Created new knob and blob endpoint limit for concurrent list operations to fix this. Increased blob request timeout default because some requests were taking longer. Crash fixes in blobstore doRequest() which wasn't checking that response object is valid before using it in error conditions. Filesystem-like backup container class (covering blobstore and local dirs) now ignores unrecognized filenames for describe() and expire() operations. 2018-01-05 23:06:39 -08:00
Evan Tschannen 5ac4f73978 Merge branch 'release-5.1' into feature-remote-logs
# Conflicts:
#	fdbclient/NativeAPI.actor.cpp
#	fdbrpc/Locality.h
#	fdbrpc/simulator.h
#	fdbserver/ApplyMetadataMutation.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
#	fdbserver/WorkerInterface.h
#	fdbserver/masterserver.actor.cpp
#	flow/Net2.actor.cpp
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-05 11:33:42 -08:00
Stephen Atherton 78430425e8 Blob bucket listings will now use parallel recursive requests on CommonPrefixes, up to a max depth, if a delimiter is provided. 2018-01-02 23:17:52 -08:00
Stephen Atherton 07fde9dfb4 Bug fix, error code 429 was not being treated as retryable in the recent refactor. 2018-01-02 23:15:25 -08:00
Stephen Atherton f324afc13f Bug fix in blob store listing when it requires multiple serial requests Added more trace events to FileBackup and BlobStoreEndpoint with suppression and added suppression to existing trace events. 2017-12-22 17:08:25 -08:00
Stephen Atherton f2524ffd33 AsyncFileBlobStoreWrite was prohibiting the writing of 0-byte files. Improved HTTP verbose logging to stdout. Added writing a 0-byte file to BackupContainer unit test. Added backup log and snapshot sizes to backup description. 2017-12-21 21:15:26 -08:00
Stephen Atherton e0ef5a9a20 Whitespace normalization. 2017-12-21 12:07:29 -08:00
Stephen Atherton e3aee45a74 Backup tools and agent now accept blob account credentials via files containing JSON which are specified using command line arguments and/or an environment variable. Improved fdbbackup help, clarifying which options are for which operations. Fdbbackup operations which do not need to use a database no longer require a cluster file parameter. Added eat() commands to StringRef for incrementally tokenizing strings using separator strings. 2017-12-21 01:58:15 -08:00
Stephen Atherton e0d9cea008 Merge branch 'master' into continuous-backup
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
#	fdbrpc/BlobStore.actor.cpp
2017-12-19 23:02:14 -08:00
Alex Miller 9a0df6d76d Deallocate aligned_alloc with aligned_free.
This probably fixes a windows-only crash, as only windows cares about this distinction.
2017-12-14 15:12:05 -08:00
Stephen Atherton b6cfe010a1 Bug fix in URL encoding of delimiter. 2017-12-12 17:31:19 -08:00
Stephen Atherton 872edd7540 Merge branch 'release-5.0'
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
2017-12-06 16:27:04 -08:00
Stephen Atherton 41f80bf7ed Renamed an error, changed blob request failure to Warn severity. 2017-12-06 15:58:54 -08:00
Stephen Atherton 4bc7d0b86a Updated error names and severities. 2017-12-06 15:42:44 -08:00
Stephen Atherton abb2dd1ebc Merge pull request #214 from cie/alexmiller/fallocate
Use fallocate to zero ranges instead of writing zeroes
2017-12-06 13:47:40 -08:00
Alex Miller 064670a95b Maintain a reference to the IAsyncFile in zeroRange.
And also add some notes about the reference semantics to the IAsyncFile header
for future readers.
2017-12-06 13:41:21 -08:00
Balachandar Namasivayam 1f949240f5 Make fdbbackup s3 compatible.
s3 sends response in XML.  FDB backup expects json response. Added a new libraray xml2json to convert xml to json.
2017-12-05 17:13:15 -08:00
Stephen Atherton 86ae6c09c7 Bug fixes, take(1) is incorrect usage of FlowLock. 2017-12-04 10:20:50 -08:00
Evan Tschannen 482ac38ca6 added knobs so that the client failure monitoring update rate and the server failure monitoring update rate are separate knobs 2017-12-01 13:04:32 -08:00
Alex Miller 7bab3a4ece AsyncFileKAIO will prefer using fallocate's ZERO_RANGE for AsyncFile::zero().
For situations in which we have support for FALLOC_FL_ZERO_RANGE, it's much
faster to use fallocate than manually overwrite the file with zero bytes.  Note
that this support depends on having a kernel from late 2014 or newer, and being
on ext4 or xfs.  If these conditions aren't met, we'll fall back to writing
zeros in 1MB chunks as normal.
2017-11-30 17:57:55 -08:00
Alex Miller 196258080b Refactor zeroing a chunk of a file from DiskQueue into IAsyncFile.
If we're going to do the work to provide more optimized ways to zero files,
then I'd feel better with this being in a more common place, so that any other
zero-ers are likely to reuse it.  It also makes testing easier/more obvious.

Also, because it's needed for correctness, fix the aligned_alloc for OSX, which
wasn't aligned, and use an actually aligned allocation function.
2017-11-30 17:57:55 -08:00
Alex Miller c7a120c59d Rename IAsyncFile::incrementalDelete -> IAsyncFileSystem::incrementalDeleteFile.
`deleteFile` existed in IAsyncFileSystem, so an incremental delete function
seems to belong more as a virtual method on IAsyncFileSystem than a static
method on IAsyncFile, and the naming should match.

As long as we're here, change IAsyncFile to declare a virtual destructor, so
that it has good and proper C++ behavior.  I presume this is what was vaguely
intended by the default constructor definition that previously existed?
2017-11-30 17:19:10 -08:00
Stephen Atherton 1e643239f9 Improvement in blob connnection reuse, oldest connnections in pool are now used first. 2017-11-30 12:57:29 -08:00
Stephen Atherton 1b1c8e985a Merge branch 'master' into backup-container-refactor
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
2017-11-25 19:54:51 -08:00
Alex Miller f19cb3bbbd Merge pull request #208 from cie/alexmiller/grvtfix
Fix the GRV performance regression
2017-11-17 15:00:44 -08:00
Alex Miller e9412bbb11 Fix the GRV performance regression introduced by adding the policy engine to GRV calculations.
Construction of LocalityGroup from LocalityData is expensive, and the previous
code greatly ran afoul of that.  The policy engine does a large amount of
interning of strings and building compressed maps to make the expected many
future selectReplica calls cheap.  Unfortunately we don't call selectReplicas,
so much of this work is undesireable for us, and a large amount of CPU time is
spent doing this initialization work.

The new changes aggressively do the minimal LocalityGroup::add() calls
necessary, and make them as cheap as possibly by removing all elements from
LocalityData that don't need to be considered by the policy.

This optimization was also applied to the PeekCursor used during recovery,
which should speed recoveries up by a small amount.
2017-11-16 16:15:52 -08:00
Stephen Atherton a77162b53d Merge branch 'master' into backup-container-refactor
# Conflicts:
#	fdbclient/BackupAgent.h
#	fdbclient/FileBackupAgent.actor.cpp
#	fdbclient/KeyBackedTypes.h
2017-11-15 08:14:47 -08:00
Stephen Atherton e07dcb9ada Fixed header paths. 2017-11-15 00:05:20 -08:00
Stephen Atherton 3dfaf13b67 IBackupContainer has been rewritten to be a logical interface for storing, reading, deleting, expiring, and querying backup data. The details of how the data is organized or stored is now hidden from users of the interface. Both the local and blobstore containers have been rewritten, the key changes being a multi level directory structure and no more use of temporary files or pseudo-symlinks in the blob store implementation. This refactor has a large impact radius as the previous backup container was just a thin wrapper that presented a single level list of files and offered no methods for managing or interpreting the file structure so all of that logic was spread around other places in the code base. This made moving to the new blob store schema very messy, and without this refactor further changes in the future would only be worse.
Several backup tasks have been cleaned up / simplified because they no longer need to manage the ‘raw’ structure of the backup.  The addition of IBackupFile and its finish() method simplified the log and range writer tasks.  Updated BlobStoreEndpoint to support now-required bucket creation and bucket listing prefix/delimiter options for finding common prefixes.  Added KeyBackedSet<T> type.  Moved JSONDoc to its own header.  Added platform::findFilesRecursively().

Still to do:  update command line tool to use new IBackupContainer interface, fix bugs in Restore startup.
2017-11-14 23:33:17 -08:00
Balachandar Namasivayam 987379d790 Changed naming of num_incompatible_connections to numIncompatibleConnections 2017-11-14 18:37:29 -08:00
Balachandar Namasivayam 27b67cffbe The earlier implementation of tracking number of incompatible connection had a bug where the counter will be incorrectly decremented for incoming connections on certain conditions.
Now the counter increment and decrement happens in the same ACTOR (ConnecitonReader) and makes it easy to verify its correctness.
2017-11-13 15:07:39 -08:00
Balachandar Namasivayam 9809e84806 Added a counter to keep track of active outgoing incompatible connections.
This counter is used to print a warning in fdbcli if there are incompatible peers.

Example Output:

./fdbcli
Using cluster file `fdb.cluster'.

WARNING: Incompatible peers exist.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status

WARNING: Incompatible peers exist.

Using cluster file `fdb.cluster'.

Could not communicate with a quorum of coordination servers:
  127.0.0.1:4000  (unreachable)
2017-11-09 11:20:35 -08:00
Evan Tschannen 57aba0b3bc fix: excluded servers were the same fitness as storage servers for the master role
fix: better master exists did not considers exclusion for master fitness
2017-11-03 17:09:14 -07:00
John Brownlee d46e240de2 Merge branch 'release-5.0'
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
#	versions.target
2017-11-02 10:42:30 -07:00
Stephen Atherton f050105243 Added HTTP 502 to the list of retryable errors. 2017-11-01 11:41:32 -07:00
Alex Miller 3b61b76876 Fix a massive amount of valgrind errors and make them easier to debug in the future.
std::is_pod<> being less restrictive than is_binary_serializable<> meant that
structs that both were POD and had a serialize method defined would be binary
serialized instead of using the defined serialize().  This means that it would
also serialize any padding that the struct contained, which would cause mass
waves of valgrind failures from uninitialized memory.

Included in this change is additional uses of valgrind client requests so that
attempts to send uninitialized memory are reported at the sending site, versus
as part of checksum calculation in sending the packet.
2017-10-27 16:54:44 -07:00
Evan Tschannen df74e2a373 re-added support for non-copying tlog recovery 2017-10-24 15:09:31 -07:00
Stephen Atherton 45fa3680fa Restore logging of remote address (if connected) or host (if connection fails) for blob errors. 2017-10-20 21:47:23 -07:00
Stephen Atherton 3afc85881e Merge branch 'master' into backup-container-refactor
# Conflicts:
#	fdbrpc/BlobStore.actor.cpp
2017-10-20 21:38:28 -07:00
Stephen Atherton 42955012e9 Merge branch 'release-5.0'
# Conflicts:
#	fdbrpc/BlobStore.actor.cpp
#	flow/error_definitions.h
2017-10-20 21:16:55 -07:00
Stephen Atherton 9f151314b3 Changed some trace event severities. Also fixed a weird casing of “retryable”. 2017-10-19 17:47:42 -07:00
Evan Tschannen e2c1e87df6 made a large number of fixes to make fearless DR correctness clean. 2017-10-19 15:36:32 -07:00
Stephen Atherton caad691ae2 Added comments for how to handle HTTP 400 errors gracefully in certain instances should the need arise. 2017-10-18 23:47:59 -07:00
Stephen Atherton ef84e52127 Improved error handling and memory usage in AsyncFileBlobStoreWrite. Writes will now fail if any upload has already failed, rather than buffering unboundedly until sync() is called to complete the file. There is also a configurable limit on how many uploads can be pending before writes will stall waiting for one to finish. 2017-10-18 05:51:30 -07:00
Stephen Atherton ebd0234514 Rewrote most error handling in BlobStoreEndpoint to fix several shortcomings in error handling and logging. The request loop now logs but rate limits all errors, and the exceptions thrown are more appropriate. HTTP 503 is now treated as retryable. Callers of BlobStoreEndpoint::doRequest() now specify which codes they consider to be successful so that more error handling can take place in the main request loop. 2017-10-18 02:52:09 -07:00
Alex Miller 7b9bc1d715 Merge pull request #170 from cie/alexmiller/flowprofile
Add support for profiling a running fdb cluster to fdbcli, fix security issues, and add an improved backtrace.
2017-10-16 16:51:53 -07:00
Alex Miller cf646d4a99 Address review comments.
* Fixed fdbcli to be more idiomatic.
* Removed is_binary_serializable in favor of std::is_pod<>
* Removed custom enable_if<> in favor of std::enable_if<>
* Removed HEY REVIEWER comments
* Removed print from prof.py
* Added FLOW_PROFILER_ENABLED=yes to circus components that wished to enable the flow profiler.
2017-10-16 16:46:52 -07:00
Yichi Chiang a6ae89af1a Merge pull request #176 from cie/add-cluster-controller-process-class
Add cluster controller process class
2017-10-16 16:27:54 -07:00
Yichi Chiang af2aa41136 Downgrade Transaction process class for cluster controller 2017-10-16 16:27:01 -07:00
Yichi Chiang 76c5488421 Add cluster controller process class 2017-10-16 16:21:25 -07:00
Stephen Atherton e934604f67 Added DNS resolution. Interface is INetworkConnections::resolveTCPEndpoint() to resolve, or for convenience INetworkConnections::connect(host, service) will resolve host and service (port number or service name like http) and connect to one of the addresses at random.
BlobStoreEndpoint now only accepts hostnames and an optional service, so this update is not compatible with the previous URL formats having many IP addresses.
2017-10-15 21:51:11 -07:00
Evan Tschannen ff1b49be2e Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DatabaseConfiguration.cpp
2017-10-10 16:07:59 -07:00
Evan Tschannen 15962cf079 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbrpc/Locality.cpp
#	fdbrpc/Locality.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/ClusterRecruitmentInterface.h
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
#	fdbserver/WorkerInterface.h
#	fdbserver/fdbserver.vcxproj.filters
#	fdbserver/masterserver.actor.cpp
#	fdbserver/worker.actor.cpp
#	flow/error_definitions.h
2017-10-05 17:09:44 -07:00
Alvin Moore de8f875038 Fixed call to IsClear
Changed killMachine and killDataCenter interface to return final killtype
Updated TESTs for DataCenter to ensure that DataCenter was killed
Added assertion to ensure that failed DC kills were not downgrades
2017-10-05 03:07:20 -07:00
Stephen Atherton fd5fe3a000 Add slightly better handling of HTTP 503 in blob client. Previously it would end the blob request loop and the task doing the blob action would see a failure, but now the blob request attempt loop will continue to back off and retry. This is better because previously the task that saw the failure would be re-run quickly. 2017-10-03 15:25:49 -07:00
Stephen Atherton 03c4cea511 Added rate-controlled TraceEvents for blob http connection attempts and failures. 2017-10-03 15:21:40 -07:00
Yichi Chiang 284e35204a Fix connection count 2017-10-03 10:54:20 -07:00
Alvin Moore 5257b99d3f Fixed problem with machines RebootedAndCleared not being considered dead in availability consideration 2017-10-03 10:48:16 -07:00
Alvin Moore d099656557 Merge branch 'release-5.0' 2017-10-02 12:05:24 -07:00
Alvin Moore 25513d8e2c Added tests for DataCenter kills 2017-10-02 12:04:28 -07:00
Evan Tschannen 6ea9903c82 Merge branch 'release-5.0'
# Conflicts:
#	fdbbackup/backup.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	versions.target
2017-10-01 18:46:44 -07:00
Stephen Atherton 058300be16 Each blobstore request will again select a random remote address. This used to happen before recent load balancing improvements related to focusing too much load on consistently up endpoints after others have recovered from being down. 2017-10-01 16:17:38 -07:00
Stephen Atherton a95107417f Improved behavior of slow writes during backup. KeyRange and Log backup tasks now use TaskBucket::saveAndExtend() to keep the task alive until flushing the file finishes or fails with an error (blob uploads fail after a limited number of retries). This prevents blob uploads from being retried too often if the destination is slow since a task abort and retry would start the backoff counters back at zero. Also removed a debugging behavior that was accidentally checked in. 2017-10-01 16:01:24 -07:00
Stephen Atherton a098919b20 Bug fix, releaser declared in wrong place, and lots of whitespace cleanup from try blocks that were no longer needed. 2017-10-01 11:25:50 -07:00
Stephen Atherton af87ac301d Removed wait never used for debugging which was accidentally included in bug fix. 2017-10-01 11:19:38 -07:00
Stephen Atherton 6000cafde1 Bug fix, locks were being taken inside try/catch so release would be done even if the take threw an error. Changed to using a Releaser. 2017-10-01 10:46:55 -07:00
Evan Tschannen f84e7252e8 fix: there was a reference counting cycle in asyncFileBlobStore and asyncFileReadAhead 2017-09-29 19:13:08 -07:00
A.J. Beamon 38616424f6 Report a couple error cases in blobstore URL parsing when dealing with numbers. 2017-09-29 17:58:49 -07:00
Alex Miller c40c1bb5fe Add a new workload: BackupToDBAbort, which does an ACI switchover.
This is to allower easier testing of non-durable switchovers without having to
wiggle into BackupToDBCorrectness's view of the world.
2017-09-29 15:58:36 -07:00
Evan Tschannen a1f8b546e6 fix: ensure connections to blob store are evenly distributed across network addresses
added a per address limit to the number of open connections
lowered a variety of knobs to prevent us from using too much memory
2017-09-29 14:59:24 -07:00
A.J. Beamon d30c730f75 Add the ability to access name and description in Error. Update error descriptions. 2017-09-28 12:35:03 -07:00
Alvin Moore 298b54104e Merge branch 'release-5.0' 2017-09-26 11:16:14 -07:00
Alvin Moore 02525d7b14 Added TESTs to ensure that all of the different kills are performed during simulation 2017-09-26 11:15:39 -07:00
Evan Tschannen e8b895c878 added the ability to disable connection failures for a period of time after one happens 2017-09-18 12:46:29 -07:00
Evan Tschannen 8cb53fd608 Merge pull request #149 from cie/choose-leader-on-stateless-processes
choose leader on the perferred process class
2017-09-13 13:58:49 -07:00
Alvin Moore b1dd2ac6fe Merge branch 'release-5.0' 2017-09-12 13:34:28 -07:00
Alvin Moore 4a6fb10a42 Added TraceEvents for remaining and killed workers when killing DataCenter
Fixed consideration of excluded workers when checking cluster availability
2017-09-12 13:33:13 -07:00
Evan Tschannen 76e7988663 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/OldTLogServer.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/WorkerInterface.h
#	flow/Net2.actor.cpp
2017-09-11 15:15:56 -07:00
Evan Tschannen ea26bc1c43 passed first tests which kill entire datacenters
added configuration options for the remote data center and satellite data centers
updated cluster controller recruitment logic
refactors how master writes core state
updated log recovery, and log system peeking
2017-09-07 15:32:08 -07:00
Evan Tschannen 6f6dbe4b33 fix: load balance will still use second requests when client locality is present 2017-09-01 11:14:18 -07:00
Alvin Moore 0994587573 Fixed OS X compilation build warnings due to printf field specifiers 2017-09-01 09:35:56 -07:00
Alvin Moore fd439e9d1c Fixed OS X compilation build warnings due to printf field type specifiers 2017-09-01 09:34:53 -07:00
Stephen Atherton 6e9de8f35a Bug fix. eraseDirectoryRecursive() on MacOS used to do nothing at all, but now it erases directories recursively. The Linux version was modified to be simpler and use a version of the FTW API that also works on MacOS. 2017-08-31 00:11:18 -07:00
A.J. Beamon 9a0a3b6329 Merge commit '66528becb82d826e81fa644bb378212584ab580e' 2017-08-28 16:47:59 -07:00
Yichi Chiang 9fe927127f choose leader on the perferred process class 2017-08-28 14:41:04 -07:00
Alvin Moore 44e0df78c5 Added support for tracking roles for simulation workers
Fixed the exclusion and inclusion address simulation API and integration within workloads
Added more information within trace events for simulation
2017-08-28 11:25:37 -07:00
Alec Grieser 300b5a17ed Merge branch 'release-5.0' 2017-08-25 18:55:33 -07:00
Evan Tschannen 272b4b984c fix: fixed a rare bug where we do not wait for a file in the process of being deleted to shutdown before rebooting a machine 2017-08-25 10:12:58 -07:00
A.J. Beamon 45c0585891 Merge branch 'release-5.0' 2017-08-24 14:48:47 -07:00
Alvin Moore 0c1be7537c Fixed OSX compilation warning about printf field value specification 2017-08-24 12:30:38 -07:00
Alec Grieser 2b678f6e91 Merge remote-tracking branch 'origin/release-5.0' 2017-08-23 10:24:23 -07:00
Alvin Moore 17c6392295 Added support for printing out information on the current simulation workers 2017-08-22 16:56:33 -07:00