Commit Graph

1575 Commits

Author SHA1 Message Date
Evan Tschannen e3c6b66240 fix: do not commit more data after being stopped
fix: prioritize dc locality above exclusion to prevent being stuck after excluding all machines in a data center
2018-02-26 13:13:37 -08:00
Evan Tschannen 37a6a81634 Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
# Conflicts:
#	fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00
Alvin Moore de1551c20d Merge branch 'release-5.1' 2018-02-23 08:24:06 -08:00
Alvin Moore a1382895a6 Fixed headers and some whitespace 2018-02-23 04:50:23 -08:00
Alec Grieser e1162e9238 Merge remote-tracking branch 'upstream/release-5.1' 2018-02-22 11:16:12 -08:00
Alec Grieser 0bae9880f1 remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py 2018-02-21 10:25:11 -08:00
Alec Grieser aadc06de99 Merge remote-tracking branch 'upstream/release-5.1' 2018-02-20 14:28:29 -08:00
Alec Grieser 1c1ae7d70e Merge remote-tracking branch 'upstream/release-5.1' into bindings-format 2018-02-19 12:37:06 -08:00
Evan Tschannen 31b89a638f added satellite_none and remote_none options to unconfigure from a fearless setup
fix: log_router configuration was broken
2018-02-17 13:51:17 -08:00
Evan Tschannen dc93759e15 suppressed trace events that are spammy 2018-02-16 16:01:19 -08:00
Evan Tschannen cb25564d38 simulated cluster supports fearless configurations
removed unused simulation variables
run the simulation with only 1 coordinator most of the time, since we protect the coordinator from being killed, and protecting too many things is bad for simulation
2018-02-15 18:32:39 -08:00
A.J. Beamon 814ae16016 Add destination tokens to Net2_LargePacket trace events. Add backtrace when a sent packet is too large. 2018-02-15 14:54:35 -08:00
Balachandar Namasivayam f320b1b347 Change ConnectionClosed TraceEvent severity from SevError to SevWarnAlways. 2018-02-14 12:25:54 -08:00
Stephen Atherton 0a35f167e4 Merge branch 'master' into feature-redwood
# Conflicts:
#	fdbserver/DiskQueue.actor.cpp
#	fdbserver/IDiskQueue.h
#	fdbserver/Knobs.cpp
#	fdbserver/Knobs.h
#	fdbserver/fdbserver.vcxproj
#	fdbserver/fdbserver.vcxproj.filters
#	fdbserver/worker.actor.cpp
2018-02-12 01:30:02 -08:00
Evan Tschannen 42405c78a5 Merge commit '4038bd2fd968d88861f2cebd442ce511724816cb' into feature-remote-logs
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/Knobs.cpp
2018-02-10 12:08:52 -08:00
Evan Tschannen fbadcc6eea changing a storage server’s tag must be the first mutations applied in a version, because privatized mutations applied earlier in the same version will use the old tag 2018-02-09 18:21:29 -08:00
Evan Tschannen c7b3be5b19 re-enabled better master exists
the cluster controller can choose a better data center for itself and let the workers know where the next cluster controller should be recruited
2018-02-09 16:48:55 -08:00
Stephen Atherton 69425a303b Improved error handling for cases where blob account credentials are either not found in the provided credentials sources and/or some of the credentials sources provided are not readable or parseable. 2018-02-07 21:50:43 -08:00
Stephen Atherton f8522248cb Blob credentials files were being opened in read-write mode despite the read-only option being specified because the underlying caching layer opens always opens files for read/write access. For now, disabled caching for this file. 2018-02-07 16:25:16 -08:00
Stephen Atherton d8879dc3f3 HTTP::doRequest() now reads responses in parallel with sending requests, so if the server responds before receiving all of the the request the client can stop sending the remainder of the request. For PUT requests which upload files, this prevents sending potentially several megabytes of unnecessary bytes if the server responds with an error (such as 429) before the request is completely sent. Updated the backup container unit test to use more parallelism in order to test this new behavior. 2018-02-07 10:38:31 -08:00
Stephen Atherton 0792d5e3dd Fix: last restorable version for a backup tag name (a separate value from the latest restorable version for a configured backup) was not being updated.
Fix: backup blob speed was sometimes an error because the JSON $sum merge operator did not support mixed numeric types.
Fix: JSON merge operator handling was squashing errors in some cases, which was generally obscuring the backup speed metric issue.
Cleaned up some of the JSON object merging logic.
Improved error messages in JSON merge operators.  Added JSON merge operator tests for mixed numeric math and improved readability of test output.
2018-02-06 13:44:04 -08:00
Evan Tschannen ebd94bb654 removed a separately configurable storage team size for the remote data center, because it did not make sense
fix: the master did not monitor for the failure of remote logs
stop merge attempts when a data center is failed
fixed a variety of other problems with data distribution when a data center is failed
2018-02-02 11:46:04 -08:00
Evan Tschannen 2e3b1d7ab8 Merge commit 'dd6ea70051aef215315e9eb3dea3b67a24778e32' into feature-remote-logs
# Conflicts:
#	flow/Net2.actor.cpp
2018-01-29 17:11:03 -08:00
Stephen Atherton 2f291d8955 Bug fix in blob backup container deletion. The list/delete loop could end before deleting all of the files, but the index entry would still be deleted. Also preemptively made the same code change in listBucket() - Although it is technically correct as written it is a dangerous style because it is not obvious that the addition of a wait() call in the second 'when' block would create a bug. Consolidated deleteContainer() and deleteBucket() as they differ by only 1 line. 2018-01-29 00:32:41 -08:00
Alec Grieser 51781bb7a8 Merge branch 'release-5.1' into bindings-format 2018-01-26 12:28:29 -08:00
Evan Tschannen 79d94214a4 Merge commit 'f4ffc9752b5ec66ac47f5f684a5d8be06a7eae6e' into feature-remote-logs 2018-01-25 10:12:06 -08:00
Stephen Atherton 9fd2a8df3d Tweaked a trace event suppression time. 2018-01-24 19:08:24 -08:00
Alec Grieser 57986cfe00 format python files to be roughtly pep8 compliant 2018-01-24 19:06:58 -08:00
A.J. Beamon 19ed388c0e Merge branch 'release-5.0' into release-5.1
# Conflicts:
#	documentation/sphinx/source/downloads.rst
#	documentation/sphinx/source/release-notes.rst
#	versions.target
2018-01-24 14:43:41 -08:00
Stephen Atherton 7f18d59dfe Bug fix, the blob request attempt count is now incremented for all errors except response code 429. 2018-01-24 01:15:01 -08:00
Stephen Atherton a2481343ec Bug fix, HTTP error code 429 was not being considered retryable in blob client (this was previously fixed but apparently reintroduced). 2018-01-24 00:22:11 -08:00
Stephen Atherton 66de9d392b New error code, http_auth_failed, which is used when blob authentication fails instead of the previous generic http_request_failed. 2018-01-22 14:58:56 -08:00
Evan Tschannen 698ef4117e Merge branch 'master' into feature-remote-logs 2018-01-20 10:34:30 -08:00
Stephen Atherton 307e04c0ad Updated backup container unit test to match new safer behavior of expireData(). Rewrote BackupContainerLocalDirectory::deleteContainer() to actually delete the whole directory but only if it appears to be a backup with either log or snapshot data. 2018-01-18 00:36:28 -08:00
Stephen Atherton 93b34a945f Major usability and performance improvements to backup management. Backup descriptions now calculate and display timestamps using TimeKeeper data (if given a cluster) and restorability of snapshots. Expire now requires a --force option to leave a backup unrestorable or unrestorable after a given point in time, specified by version or timestamp. BackupContainerFilesystem now maintains metadata on key version boundaries in order to avoid large list operations for describe and expire operations. Blob parallel recursive list operations can now take a path (aka prefix) filter function. New describe and expire options are available in fdbbackup. 2018-01-17 04:09:43 -08:00
Evan Tschannen 21482a45e1 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DBCoreState.h
#	fdbserver/LogSystem.h
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/TLogServer.actor.cpp
2018-01-14 13:40:24 -08:00
Alvin Moore 2e6ce03224 Merge pull request #232 from cie/build-dont-compile-hpp
Filter out .hpp files from *_BUILD_SOURCES (like we do with .h files)…
2018-01-12 14:09:25 -08:00
Evan Tschannen 02bd83ff76 changed incompatibleDataRead to an asyncTrigger 2018-01-11 13:35:56 -08:00
A.J. Beamon 80b84c23ac Filter out .hpp files from *_BUILD_SOURCES (like we do with .h files). Add xml2json.hpp to our fdbrpc project. 2018-01-10 13:51:57 -08:00
A.J. Beamon ce93d98b50 Temporarily remove xml2json.hpp from fdbrpc vcxproj 2018-01-10 10:18:44 -08:00
A.J. Beamon 2f5073d00f Some visual studio project cleanup. 2018-01-10 10:07:18 -08:00
Stephen Atherton 0e7d538c94 Bug fix, in recursive blob folder listings the recent removal of common prefixes from the result stream caused the list marker to not be set correctly when a folder level requires multiple requests due to folder size. 2018-01-06 20:58:48 -08:00
Evan Tschannen 3ec45d38a0 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-06 13:54:45 -08:00
Stephen Atherton 96cb06cbc7 Bug fixes. Fdbbackup delete was broken. Blobstore backup container deletion would do too much listing before deletions began due to list operations queueing up ahead of and starving the delete operations. Created new knob and blob endpoint limit for concurrent list operations to fix this. Increased blob request timeout default because some requests were taking longer. Crash fixes in blobstore doRequest() which wasn't checking that response object is valid before using it in error conditions. Filesystem-like backup container class (covering blobstore and local dirs) now ignores unrecognized filenames for describe() and expire() operations. 2018-01-05 23:06:39 -08:00
Evan Tschannen 5ac4f73978 Merge branch 'release-5.1' into feature-remote-logs
# Conflicts:
#	fdbclient/NativeAPI.actor.cpp
#	fdbrpc/Locality.h
#	fdbrpc/simulator.h
#	fdbserver/ApplyMetadataMutation.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
#	fdbserver/WorkerInterface.h
#	fdbserver/masterserver.actor.cpp
#	flow/Net2.actor.cpp
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-05 11:33:42 -08:00
Stephen Atherton 78430425e8 Blob bucket listings will now use parallel recursive requests on CommonPrefixes, up to a max depth, if a delimiter is provided. 2018-01-02 23:17:52 -08:00
Stephen Atherton 07fde9dfb4 Bug fix, error code 429 was not being treated as retryable in the recent refactor. 2018-01-02 23:15:25 -08:00
Stephen Atherton f324afc13f Bug fix in blob store listing when it requires multiple serial requests Added more trace events to FileBackup and BlobStoreEndpoint with suppression and added suppression to existing trace events. 2017-12-22 17:08:25 -08:00
Stephen Atherton f2524ffd33 AsyncFileBlobStoreWrite was prohibiting the writing of 0-byte files. Improved HTTP verbose logging to stdout. Added writing a 0-byte file to BackupContainer unit test. Added backup log and snapshot sizes to backup description. 2017-12-21 21:15:26 -08:00
Stephen Atherton e0ef5a9a20 Whitespace normalization. 2017-12-21 12:07:29 -08:00
Stephen Atherton e3aee45a74 Backup tools and agent now accept blob account credentials via files containing JSON which are specified using command line arguments and/or an environment variable. Improved fdbbackup help, clarifying which options are for which operations. Fdbbackup operations which do not need to use a database no longer require a cluster file parameter. Added eat() commands to StringRef for incrementally tokenizing strings using separator strings. 2017-12-21 01:58:15 -08:00
Stephen Atherton e0d9cea008 Merge branch 'master' into continuous-backup
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
#	fdbrpc/BlobStore.actor.cpp
2017-12-19 23:02:14 -08:00
Alex Miller 9a0df6d76d Deallocate aligned_alloc with aligned_free.
This probably fixes a windows-only crash, as only windows cares about this distinction.
2017-12-14 15:12:05 -08:00
Stephen Atherton b6cfe010a1 Bug fix in URL encoding of delimiter. 2017-12-12 17:31:19 -08:00
Stephen Atherton 872edd7540 Merge branch 'release-5.0'
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
2017-12-06 16:27:04 -08:00
Stephen Atherton 41f80bf7ed Renamed an error, changed blob request failure to Warn severity. 2017-12-06 15:58:54 -08:00
Stephen Atherton 4bc7d0b86a Updated error names and severities. 2017-12-06 15:42:44 -08:00
Stephen Atherton abb2dd1ebc Merge pull request #214 from cie/alexmiller/fallocate
Use fallocate to zero ranges instead of writing zeroes
2017-12-06 13:47:40 -08:00
Alex Miller 064670a95b Maintain a reference to the IAsyncFile in zeroRange.
And also add some notes about the reference semantics to the IAsyncFile header
for future readers.
2017-12-06 13:41:21 -08:00
Balachandar Namasivayam 1f949240f5 Make fdbbackup s3 compatible.
s3 sends response in XML.  FDB backup expects json response. Added a new libraray xml2json to convert xml to json.
2017-12-05 17:13:15 -08:00
Stephen Atherton 86ae6c09c7 Bug fixes, take(1) is incorrect usage of FlowLock. 2017-12-04 10:20:50 -08:00
Evan Tschannen 482ac38ca6 added knobs so that the client failure monitoring update rate and the server failure monitoring update rate are separate knobs 2017-12-01 13:04:32 -08:00
Alex Miller 7bab3a4ece AsyncFileKAIO will prefer using fallocate's ZERO_RANGE for AsyncFile::zero().
For situations in which we have support for FALLOC_FL_ZERO_RANGE, it's much
faster to use fallocate than manually overwrite the file with zero bytes.  Note
that this support depends on having a kernel from late 2014 or newer, and being
on ext4 or xfs.  If these conditions aren't met, we'll fall back to writing
zeros in 1MB chunks as normal.
2017-11-30 17:57:55 -08:00
Alex Miller 196258080b Refactor zeroing a chunk of a file from DiskQueue into IAsyncFile.
If we're going to do the work to provide more optimized ways to zero files,
then I'd feel better with this being in a more common place, so that any other
zero-ers are likely to reuse it.  It also makes testing easier/more obvious.

Also, because it's needed for correctness, fix the aligned_alloc for OSX, which
wasn't aligned, and use an actually aligned allocation function.
2017-11-30 17:57:55 -08:00
Alex Miller c7a120c59d Rename IAsyncFile::incrementalDelete -> IAsyncFileSystem::incrementalDeleteFile.
`deleteFile` existed in IAsyncFileSystem, so an incremental delete function
seems to belong more as a virtual method on IAsyncFileSystem than a static
method on IAsyncFile, and the naming should match.

As long as we're here, change IAsyncFile to declare a virtual destructor, so
that it has good and proper C++ behavior.  I presume this is what was vaguely
intended by the default constructor definition that previously existed?
2017-11-30 17:19:10 -08:00
Stephen Atherton 1e643239f9 Improvement in blob connnection reuse, oldest connnections in pool are now used first. 2017-11-30 12:57:29 -08:00
Stephen Atherton 1b1c8e985a Merge branch 'master' into backup-container-refactor
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
2017-11-25 19:54:51 -08:00
Alex Miller f19cb3bbbd Merge pull request #208 from cie/alexmiller/grvtfix
Fix the GRV performance regression
2017-11-17 15:00:44 -08:00
Alex Miller e9412bbb11 Fix the GRV performance regression introduced by adding the policy engine to GRV calculations.
Construction of LocalityGroup from LocalityData is expensive, and the previous
code greatly ran afoul of that.  The policy engine does a large amount of
interning of strings and building compressed maps to make the expected many
future selectReplica calls cheap.  Unfortunately we don't call selectReplicas,
so much of this work is undesireable for us, and a large amount of CPU time is
spent doing this initialization work.

The new changes aggressively do the minimal LocalityGroup::add() calls
necessary, and make them as cheap as possibly by removing all elements from
LocalityData that don't need to be considered by the policy.

This optimization was also applied to the PeekCursor used during recovery,
which should speed recoveries up by a small amount.
2017-11-16 16:15:52 -08:00
Stephen Atherton a77162b53d Merge branch 'master' into backup-container-refactor
# Conflicts:
#	fdbclient/BackupAgent.h
#	fdbclient/FileBackupAgent.actor.cpp
#	fdbclient/KeyBackedTypes.h
2017-11-15 08:14:47 -08:00
Stephen Atherton e07dcb9ada Fixed header paths. 2017-11-15 00:05:20 -08:00
Stephen Atherton 3dfaf13b67 IBackupContainer has been rewritten to be a logical interface for storing, reading, deleting, expiring, and querying backup data. The details of how the data is organized or stored is now hidden from users of the interface. Both the local and blobstore containers have been rewritten, the key changes being a multi level directory structure and no more use of temporary files or pseudo-symlinks in the blob store implementation. This refactor has a large impact radius as the previous backup container was just a thin wrapper that presented a single level list of files and offered no methods for managing or interpreting the file structure so all of that logic was spread around other places in the code base. This made moving to the new blob store schema very messy, and without this refactor further changes in the future would only be worse.
Several backup tasks have been cleaned up / simplified because they no longer need to manage the ‘raw’ structure of the backup.  The addition of IBackupFile and its finish() method simplified the log and range writer tasks.  Updated BlobStoreEndpoint to support now-required bucket creation and bucket listing prefix/delimiter options for finding common prefixes.  Added KeyBackedSet<T> type.  Moved JSONDoc to its own header.  Added platform::findFilesRecursively().

Still to do:  update command line tool to use new IBackupContainer interface, fix bugs in Restore startup.
2017-11-14 23:33:17 -08:00
Balachandar Namasivayam 987379d790 Changed naming of num_incompatible_connections to numIncompatibleConnections 2017-11-14 18:37:29 -08:00
Balachandar Namasivayam 27b67cffbe The earlier implementation of tracking number of incompatible connection had a bug where the counter will be incorrectly decremented for incoming connections on certain conditions.
Now the counter increment and decrement happens in the same ACTOR (ConnecitonReader) and makes it easy to verify its correctness.
2017-11-13 15:07:39 -08:00
Balachandar Namasivayam 9809e84806 Added a counter to keep track of active outgoing incompatible connections.
This counter is used to print a warning in fdbcli if there are incompatible peers.

Example Output:

./fdbcli
Using cluster file `fdb.cluster'.

WARNING: Incompatible peers exist.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status

WARNING: Incompatible peers exist.

Using cluster file `fdb.cluster'.

Could not communicate with a quorum of coordination servers:
  127.0.0.1:4000  (unreachable)
2017-11-09 11:20:35 -08:00
Evan Tschannen 57aba0b3bc fix: excluded servers were the same fitness as storage servers for the master role
fix: better master exists did not considers exclusion for master fitness
2017-11-03 17:09:14 -07:00
John Brownlee d46e240de2 Merge branch 'release-5.0'
# Conflicts:
#	fdbclient/FileBackupAgent.actor.cpp
#	versions.target
2017-11-02 10:42:30 -07:00
Stephen Atherton f050105243 Added HTTP 502 to the list of retryable errors. 2017-11-01 11:41:32 -07:00
Alex Miller 3b61b76876 Fix a massive amount of valgrind errors and make them easier to debug in the future.
std::is_pod<> being less restrictive than is_binary_serializable<> meant that
structs that both were POD and had a serialize method defined would be binary
serialized instead of using the defined serialize().  This means that it would
also serialize any padding that the struct contained, which would cause mass
waves of valgrind failures from uninitialized memory.

Included in this change is additional uses of valgrind client requests so that
attempts to send uninitialized memory are reported at the sending site, versus
as part of checksum calculation in sending the packet.
2017-10-27 16:54:44 -07:00
Evan Tschannen df74e2a373 re-added support for non-copying tlog recovery 2017-10-24 15:09:31 -07:00
Stephen Atherton 45fa3680fa Restore logging of remote address (if connected) or host (if connection fails) for blob errors. 2017-10-20 21:47:23 -07:00
Stephen Atherton 3afc85881e Merge branch 'master' into backup-container-refactor
# Conflicts:
#	fdbrpc/BlobStore.actor.cpp
2017-10-20 21:38:28 -07:00
Stephen Atherton 42955012e9 Merge branch 'release-5.0'
# Conflicts:
#	fdbrpc/BlobStore.actor.cpp
#	flow/error_definitions.h
2017-10-20 21:16:55 -07:00
Stephen Atherton 9f151314b3 Changed some trace event severities. Also fixed a weird casing of “retryable”. 2017-10-19 17:47:42 -07:00
Evan Tschannen e2c1e87df6 made a large number of fixes to make fearless DR correctness clean. 2017-10-19 15:36:32 -07:00
Stephen Atherton caad691ae2 Added comments for how to handle HTTP 400 errors gracefully in certain instances should the need arise. 2017-10-18 23:47:59 -07:00
Stephen Atherton ef84e52127 Improved error handling and memory usage in AsyncFileBlobStoreWrite. Writes will now fail if any upload has already failed, rather than buffering unboundedly until sync() is called to complete the file. There is also a configurable limit on how many uploads can be pending before writes will stall waiting for one to finish. 2017-10-18 05:51:30 -07:00
Stephen Atherton ebd0234514 Rewrote most error handling in BlobStoreEndpoint to fix several shortcomings in error handling and logging. The request loop now logs but rate limits all errors, and the exceptions thrown are more appropriate. HTTP 503 is now treated as retryable. Callers of BlobStoreEndpoint::doRequest() now specify which codes they consider to be successful so that more error handling can take place in the main request loop. 2017-10-18 02:52:09 -07:00
Alex Miller 7b9bc1d715 Merge pull request #170 from cie/alexmiller/flowprofile
Add support for profiling a running fdb cluster to fdbcli, fix security issues, and add an improved backtrace.
2017-10-16 16:51:53 -07:00
Alex Miller cf646d4a99 Address review comments.
* Fixed fdbcli to be more idiomatic.
* Removed is_binary_serializable in favor of std::is_pod<>
* Removed custom enable_if<> in favor of std::enable_if<>
* Removed HEY REVIEWER comments
* Removed print from prof.py
* Added FLOW_PROFILER_ENABLED=yes to circus components that wished to enable the flow profiler.
2017-10-16 16:46:52 -07:00
Yichi Chiang a6ae89af1a Merge pull request #176 from cie/add-cluster-controller-process-class
Add cluster controller process class
2017-10-16 16:27:54 -07:00
Yichi Chiang af2aa41136 Downgrade Transaction process class for cluster controller 2017-10-16 16:27:01 -07:00
Yichi Chiang 76c5488421 Add cluster controller process class 2017-10-16 16:21:25 -07:00
Stephen Atherton e934604f67 Added DNS resolution. Interface is INetworkConnections::resolveTCPEndpoint() to resolve, or for convenience INetworkConnections::connect(host, service) will resolve host and service (port number or service name like http) and connect to one of the addresses at random.
BlobStoreEndpoint now only accepts hostnames and an optional service, so this update is not compatible with the previous URL formats having many IP addresses.
2017-10-15 21:51:11 -07:00
Evan Tschannen ff1b49be2e Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DatabaseConfiguration.cpp
2017-10-10 16:07:59 -07:00
Evan Tschannen 15962cf079 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbrpc/Locality.cpp
#	fdbrpc/Locality.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/ClusterRecruitmentInterface.h
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
#	fdbserver/WorkerInterface.h
#	fdbserver/fdbserver.vcxproj.filters
#	fdbserver/masterserver.actor.cpp
#	fdbserver/worker.actor.cpp
#	flow/error_definitions.h
2017-10-05 17:09:44 -07:00
Alvin Moore de8f875038 Fixed call to IsClear
Changed killMachine and killDataCenter interface to return final killtype
Updated TESTs for DataCenter to ensure that DataCenter was killed
Added assertion to ensure that failed DC kills were not downgrades
2017-10-05 03:07:20 -07:00
Stephen Atherton fd5fe3a000 Add slightly better handling of HTTP 503 in blob client. Previously it would end the blob request loop and the task doing the blob action would see a failure, but now the blob request attempt loop will continue to back off and retry. This is better because previously the task that saw the failure would be re-run quickly. 2017-10-03 15:25:49 -07:00
Stephen Atherton 03c4cea511 Added rate-controlled TraceEvents for blob http connection attempts and failures. 2017-10-03 15:21:40 -07:00
Yichi Chiang 284e35204a Fix connection count 2017-10-03 10:54:20 -07:00
Alvin Moore 5257b99d3f Fixed problem with machines RebootedAndCleared not being considered dead in availability consideration 2017-10-03 10:48:16 -07:00
Alvin Moore d099656557 Merge branch 'release-5.0' 2017-10-02 12:05:24 -07:00
Alvin Moore 25513d8e2c Added tests for DataCenter kills 2017-10-02 12:04:28 -07:00
Evan Tschannen 6ea9903c82 Merge branch 'release-5.0'
# Conflicts:
#	fdbbackup/backup.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	versions.target
2017-10-01 18:46:44 -07:00
Stephen Atherton 058300be16 Each blobstore request will again select a random remote address. This used to happen before recent load balancing improvements related to focusing too much load on consistently up endpoints after others have recovered from being down. 2017-10-01 16:17:38 -07:00
Stephen Atherton a95107417f Improved behavior of slow writes during backup. KeyRange and Log backup tasks now use TaskBucket::saveAndExtend() to keep the task alive until flushing the file finishes or fails with an error (blob uploads fail after a limited number of retries). This prevents blob uploads from being retried too often if the destination is slow since a task abort and retry would start the backoff counters back at zero. Also removed a debugging behavior that was accidentally checked in. 2017-10-01 16:01:24 -07:00
Stephen Atherton a098919b20 Bug fix, releaser declared in wrong place, and lots of whitespace cleanup from try blocks that were no longer needed. 2017-10-01 11:25:50 -07:00
Stephen Atherton af87ac301d Removed wait never used for debugging which was accidentally included in bug fix. 2017-10-01 11:19:38 -07:00
Stephen Atherton 6000cafde1 Bug fix, locks were being taken inside try/catch so release would be done even if the take threw an error. Changed to using a Releaser. 2017-10-01 10:46:55 -07:00
Evan Tschannen f84e7252e8 fix: there was a reference counting cycle in asyncFileBlobStore and asyncFileReadAhead 2017-09-29 19:13:08 -07:00
A.J. Beamon 38616424f6 Report a couple error cases in blobstore URL parsing when dealing with numbers. 2017-09-29 17:58:49 -07:00
Alex Miller c40c1bb5fe Add a new workload: BackupToDBAbort, which does an ACI switchover.
This is to allower easier testing of non-durable switchovers without having to
wiggle into BackupToDBCorrectness's view of the world.
2017-09-29 15:58:36 -07:00
Evan Tschannen a1f8b546e6 fix: ensure connections to blob store are evenly distributed across network addresses
added a per address limit to the number of open connections
lowered a variety of knobs to prevent us from using too much memory
2017-09-29 14:59:24 -07:00
A.J. Beamon d30c730f75 Add the ability to access name and description in Error. Update error descriptions. 2017-09-28 12:35:03 -07:00
Alvin Moore 298b54104e Merge branch 'release-5.0' 2017-09-26 11:16:14 -07:00
Alvin Moore 02525d7b14 Added TESTs to ensure that all of the different kills are performed during simulation 2017-09-26 11:15:39 -07:00
Stephen Atherton 1ca9814879 Bug (arguable, perhaps) fix in AsyncFileCached. Order was not being enforced between writes and truncates such that calling and waiting on a truncate to X and then writing to X + 1 could end up writing first and then truncating the written page off of the file. 2017-09-20 17:58:56 -07:00
Evan Tschannen e8b895c878 added the ability to disable connection failures for a period of time after one happens 2017-09-18 12:46:29 -07:00
Evan Tschannen 8cb53fd608 Merge pull request #149 from cie/choose-leader-on-stateless-processes
choose leader on the perferred process class
2017-09-13 13:58:49 -07:00
Alvin Moore b1dd2ac6fe Merge branch 'release-5.0' 2017-09-12 13:34:28 -07:00
Alvin Moore 4a6fb10a42 Added TraceEvents for remaining and killed workers when killing DataCenter
Fixed consideration of excluded workers when checking cluster availability
2017-09-12 13:33:13 -07:00
Evan Tschannen 76e7988663 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/OldTLogServer.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/WorkerInterface.h
#	flow/Net2.actor.cpp
2017-09-11 15:15:56 -07:00
Evan Tschannen ea26bc1c43 passed first tests which kill entire datacenters
added configuration options for the remote data center and satellite data centers
updated cluster controller recruitment logic
refactors how master writes core state
updated log recovery, and log system peeking
2017-09-07 15:32:08 -07:00
Evan Tschannen 6f6dbe4b33 fix: load balance will still use second requests when client locality is present 2017-09-01 11:14:18 -07:00
Alvin Moore 0994587573 Fixed OS X compilation build warnings due to printf field specifiers 2017-09-01 09:35:56 -07:00
Alvin Moore fd439e9d1c Fixed OS X compilation build warnings due to printf field type specifiers 2017-09-01 09:34:53 -07:00
Stephen Atherton 6e9de8f35a Bug fix. eraseDirectoryRecursive() on MacOS used to do nothing at all, but now it erases directories recursively. The Linux version was modified to be simpler and use a version of the FTW API that also works on MacOS. 2017-08-31 00:11:18 -07:00
A.J. Beamon 9a0a3b6329 Merge commit '66528becb82d826e81fa644bb378212584ab580e' 2017-08-28 16:47:59 -07:00
Yichi Chiang 9fe927127f choose leader on the perferred process class 2017-08-28 14:41:04 -07:00
Alvin Moore 44e0df78c5 Added support for tracking roles for simulation workers
Fixed the exclusion and inclusion address simulation API and integration within workloads
Added more information within trace events for simulation
2017-08-28 11:25:37 -07:00
Alec Grieser 300b5a17ed Merge branch 'release-5.0' 2017-08-25 18:55:33 -07:00
Evan Tschannen 272b4b984c fix: fixed a rare bug where we do not wait for a file in the process of being deleted to shutdown before rebooting a machine 2017-08-25 10:12:58 -07:00
A.J. Beamon 45c0585891 Merge branch 'release-5.0' 2017-08-24 14:48:47 -07:00
Alvin Moore 0c1be7537c Fixed OSX compilation warning about printf field value specification 2017-08-24 12:30:38 -07:00
Alec Grieser 2b678f6e91 Merge remote-tracking branch 'origin/release-5.0' 2017-08-23 10:24:23 -07:00
Alvin Moore 17c6392295 Added support for printing out information on the current simulation workers 2017-08-22 16:56:33 -07:00
A.J. Beamon 41c90bcdea Merge commit '89ac94853c70d08289e7fb58055bc5d0cd4e494d' 2017-07-26 15:35:36 -07:00
A.J. Beamon 311d0e3815 Remove outdated comment from incrementalDelete function. 2017-07-26 15:27:37 -07:00
A.J. Beamon d8acb11200 Remove the change that waits only for unlinking; call delete on the file even if it doesn't exist. 2017-07-26 15:25:49 -07:00
A.J. Beamon d8e308c18f Enable use of incremental delete when deleting disk queue and sqlite KVS sqlite files. 2017-07-26 14:11:11 -07:00
Evan Tschannen 64e9560599 Merge pull request #128 from cie/maintain-incompatible-connections
Maintain incompatible connections
2017-07-17 16:28:22 -07:00
A.J. Beamon 2113d47db6 Update protocol version for incompatible connection change 2017-07-17 16:16:05 -07:00
A.J. Beamon 23c2946fa3 Rename some trace events surrounding connections 2017-07-17 16:15:18 -07:00
A.J. Beamon 591d98f711 Update the incompatible version behavior change protocol version check and add a note that we'll need to appropriately set the version at merge time. 2017-07-17 11:00:45 -07:00
A.J. Beamon 650c6ff399 Merge branch 'release-5.0' into maintain-incompatible-connections 2017-07-17 10:40:36 -07:00
A.J. Beamon 9493f8f78c Merge branch 'release-5.0' 2017-07-17 09:34:37 -07:00
A.J. Beamon a7fbc56a8e Checksums computed on pages with partially undefined contents are still valid, so mark them as such for valgrind purposes. 2017-07-17 09:34:04 -07:00
Alec Grieser f75b6f333b Merge branch 'release-5.0' 2017-07-13 11:21:18 -07:00
Stephen Atherton 39ff1b3c52 Bug fix, when io_timeouts are enabled in warn only mode they weren’t being logged at all. 2017-07-05 14:43:10 -07:00
Stephen Atherton 1b1a0d27e2 Merge branch 'release-5.0'
# Conflicts:
#	versions.target
2017-06-29 15:58:04 -07:00
Stephen Atherton 028fb75f88 Added last write timestamp to lost write detector class. Renamed TraceEvent for lost writes detected since it is no longer part of the KAIO class specifically. 2017-06-29 15:11:11 -07:00
Alec Grieser 9bcdfe4ddb removed undefined behavior surrounding TLS logging 2017-06-28 14:23:53 -07:00
Alec Grieser 94bce335e7 Merge branch 'release-5.0' 2017-06-19 17:51:10 -07:00
Alvin Moore 6d19580789 Merge branch 'release-5.0' of github.com:apple/foundationdb into release-5.0
# Conflicts:
#	fdbrpc/simulator.h
2017-06-19 17:39:37 -07:00
Alvin Moore 9553458b78 Updated simulation to support managing exclusion and inclusion address
Added method for identifying acceptable availability process classes
Extended cluster availability function to ensure coordinators can be auto configured
Fixed availability function to allow protected processes to be considered as dead if not available
Added debug trace events for providing machine state when considering availability
Added trace event for protected coordinators
2017-06-19 16:48:15 -07:00
Stephen Atherton 5d13d845a7 Merge branch 'release-5.0' 2017-06-18 23:25:29 -07:00
Stephen Atherton 0e638e7ea2 Merge branch 'release-4.6' into release-5.0 2017-06-18 23:25:17 -07:00
Stephen Atherton 6d9e302487 Merge branch 'release-5.0' 2017-06-16 02:14:34 -07:00
Stephen Atherton 430bb6224e Merge branch 'release-4.6' into release-5.0
# Conflicts:
#	fdbrpc/AsyncFileKAIO.actor.h
#	fdbrpc/Net2FileSystem.cpp
#	fdbrpc/sim2.actor.cpp
2017-06-16 02:14:19 -07:00
Stephen Atherton 1c94e30e64 Merge branch 'release-5.0' 2017-06-15 17:40:40 -07:00
Stephen Atherton f405c8d88e Merge branch 'release-4.6' into release-5.0
# Conflicts:
#	fdbrpc/AsyncFileKAIO.actor.h
#	fdbrpc/sim2.actor.cpp
#	fdbserver/optimisttest.actor.cpp
#	versions.target
2017-06-15 17:40:19 -07:00
Evan Tschannen cdd64ebc15 fix: asyncFileNonDurable could never complete deleting a file in rare situations 2017-06-15 13:30:15 -07:00
Evan Tschannen afdc125db9 Merge branch 'release-5.0' 2017-06-14 16:44:23 -07:00
Evan Tschannen 4bdcd8fc12 Merge branch 'release-4.6' into release-5.0
# Conflicts:
#	bindings/bindingtester/run_binding_tester.sh
#	fdbrpc/AsyncFileKAIO.actor.h
2017-06-14 16:43:53 -07:00
Yichi Chiang 02ee6d8cd1 Change checksum enabled condition 2017-06-13 11:03:25 -07:00
Stephen Atherton e318aabe55 Merge branch 'release-5.0' 2017-05-31 17:10:48 -07:00
Stephen Atherton fa4fdb1f1d Merge branch 'fix-io-timeout-handling' into release-5.0
# Conflicts:
#	fdbserver/optimisttest.actor.cpp
2017-05-31 17:03:15 -07:00
Yichi Chiang 41d9bce2d7 Merge pull request #115 from cie/checksum-off-with-tls
Disable checksum when TLS is enabled
2017-05-30 11:43:53 -07:00
Stephen Atherton 98604d33a0 Merge branch 'fix-io-timeout-handling'
# Conflicts:
#	fdbrpc/AsyncFileKAIO.actor.h
#	fdbrpc/sim2.actor.cpp
#	fdbserver/KeyValueStoreSQLite.actor.cpp
#	fdbserver/optimisttest.actor.cpp
#	fdbserver/worker.actor.cpp
#	fdbserver/workloads/MachineAttrition.actor.cpp
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2017-05-26 18:43:08 -07:00
Stephen Atherton 7260e38545 Merge branch 'fix-io-timeout-handling'
# Conflicts:
#	fdbrpc/AsyncFileKAIO.actor.h
#	fdbrpc/sim2.actor.cpp
#	fdbserver/KeyValueStoreSQLite.actor.cpp
#	fdbserver/optimisttest.actor.cpp
#	fdbserver/worker.actor.cpp
#	fdbserver/workloads/MachineAttrition.actor.cpp
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2017-05-26 17:43:28 -07:00
Yichi Chiang d2ad46680c Disable checksum when TLS is enabled 2017-05-26 15:34:40 -07:00
Alvin Moore b28ed397a2 Fixed printf field width specifier to reduce compilation warnings within OS X 2017-05-26 14:51:34 -07:00
Alvin Moore 0b9ed67e12 Fixed support for RemoveServers Workload
Added availability functions to simulation
2017-05-26 14:20:11 -07:00
Alvin Moore 16cc0821b1 Removed dead machine option from simulation 2017-05-25 16:29:02 -07:00
FDB Dev Team a674cb4ef4 Initial repository commit 2017-05-25 13:48:44 -07:00