Commit Graph

1366 Commits

Author SHA1 Message Date
Xin Dong 5967ef5eab Added back the changes that report trace log flush failures and fix the random crash 2020-03-12 14:34:19 -07:00
A.J. Beamon 2466749648 Don't disallow allocation tracking when a trace event is open because we now have state trace events. Instead, only block allocation tracking while we are in the middle of allocation tracking already to prevent recursion. 2020-03-12 11:17:49 -07:00
A.J. Beamon 8cdf918316 Add logging when file identifiers don't match 2020-03-12 11:06:53 -07:00
Andrew Noyes 770ef6e726 Add test 2020-03-10 10:42:57 -07:00
Andrew Noyes 027029cc9b Remove offending overload? 2020-03-10 10:18:14 -07:00
Evan Tschannen 303df197cf Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	bindings/c/test/mako/mako.c
#	documentation/sphinx/source/release-notes.rst
#	fdbbackup/backup.actor.cpp
#	fdbclient/NativeAPI.actor.cpp
#	fdbclient/NativeAPI.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/Knobs.cpp
#	fdbserver/Knobs.h
#	fdbserver/LogRouter.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/fdbserver.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/flow.vcxproj
#	flow/flow.vcxproj.filters
#	versions.target
2020-03-06 18:22:46 -08:00
tclinken 2017daf7d4 Ignore createDirectory error if directory already exists 2020-03-06 16:48:23 -08:00
Evan Tschannen dbfc0cbcc0
Merge pull request #2781 from alexmiller-apple/certificate-refresh
Refresh certificates used for handshaking when they change on disk
2020-03-06 11:12:04 -08:00
Alex Miller f9969a853c Merge remote-tracking branch 'origin/certificate-refresh' into certificate-refresh 2020-03-06 11:10:05 -08:00
Alex Miller 188d9b8239 Don't swallow actor cancellation in certificate refreshing. 2020-03-06 11:09:17 -08:00
Alex Miller 9b760fae2d Rewrite all Errors into tls_errors if they happen as part of initializing TLS. 2020-03-06 11:06:19 -08:00
Alex Miller 1f56bf8933
Fix the build with success()
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-03-06 10:15:04 -08:00
Alex Miller ac52b6b474 Rework a bit of error and exception handling.
I went back and dug through all of the "what functions can throw what
types", and made sane decisions about them.  boost errors are
aggressively translated into FDB ones, whcih might result in multiple
lines of logging about errors, but this is in infrequently run code, so
it should be fine.
2020-03-06 02:33:16 -08:00
Evan Tschannen 39050308ff lower accept batch size just to be conservative with the change 2020-03-05 18:17:49 -08:00
Evan Tschannen 1128666840 added additional logging on the log router 2020-03-05 18:17:06 -08:00
Alex Miller ccef3f7d05 Attempt to fix TLS_DISABLED compiles. 2020-03-05 17:32:10 -08:00
Alex Miller 2d95a1e64d Implement certificate refreshing 2020-03-05 17:25:33 -08:00
Alex Miller 595dd77ed1 Merge remote-tracking branch 'upstream/release-6.2' into certificate-refresh 2020-03-04 20:25:42 -08:00
Alex Miller 9b5ef3416e Refactor TLSParams into TLSConfig + LoadedTLSConfig
The idea being that we keep around a TLSConfig that the configuration
that the user has provided, and then when we want to intialize an SSL
context, we ask the TLSConfig to load all certificates and return us a
LoadedTLSConfig that is a concrete set of certificate bytes in memory.

initTLS now just takes the in-memory bytes and applies them to the ssl
context.

This is a large refactor to lead up into certificate refeshing, where we
will periodically check for changes to the certificates, and then
re-load them and apply them to a new SSL context.
2020-03-04 20:14:47 -08:00
Xin Dong 39610d15f8 Revert this change since it somehow introduced a random crash detected on circus 2020-03-04 16:14:38 -08:00
Evan Tschannen 2a877bce9a
Merge pull request #2777 from etschannen/feature-accept-batch
Accept connections in batches of 20 to improve performance
2020-03-04 16:14:24 -08:00
Evan Tschannen c73cae0feb
Merge pull request #2760 from ajbeamon/client-version-fixes
Improvements to client version reporting
2020-03-04 15:52:49 -08:00
A.J. Beamon b3c3f8aa5f
Update flow/genericactors.actor.h
Pass by reference
2020-03-04 15:35:51 -08:00
Evan Tschannen 7cbabca124 remove printing to stderr from initTLS because that could cause problems on clients 2020-03-04 15:06:22 -08:00
Evan Tschannen 35a1ac6482 prepare net2 for new versions of boost 2020-03-04 14:26:01 -08:00
Evan Tschannen da579faf62 add missing task priority 2020-03-04 14:25:30 -08:00
Evan Tschannen 820957025f accept connections in batches of 20 to improve performance 2020-03-04 14:24:57 -08:00
Andrew Noyes 24bbf5a8f0 Avoid invalid read on invalid Void msg 2020-03-02 12:11:43 -08:00
Andrew Noyes cdbe3117d7 Fix typo 2020-03-02 12:11:43 -08:00
Andrew Noyes 7119b46eb2 Add unit test 2020-03-02 12:11:43 -08:00
Evan Tschannen c11c24b79d removed the fdbrpc version of platform.h 2020-02-28 14:56:10 -08:00
Andrew Noyes e6d36a0aa5 Fix Makefile build 2020-02-28 13:16:58 -08:00
Andrew Noyes f29d6c3f67 Move implementation of ArenaBlock members to Arena.cpp 2020-02-28 12:33:57 -08:00
Evan Tschannen 6054c05963 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/fdbserver.actor.cpp
#	versions.target
2020-02-28 12:11:05 -08:00
A.J. Beamon d1e1fea42d Our binaries that act like clients (fdbcli, backup and DR binaries) were reporting an unknown client version. Clients did not react if the list of supported versions changed. 2020-02-28 09:35:21 -08:00
Xin Dong 13e72f7b3b
Merge pull request #2605 from dongxinEric/fix/1977/report-inability-to-flush-trace-log
Report inability to flush trace logs.
2020-02-27 12:36:55 -08:00
Xin Dong 16575ae94d Address review comments 2020-02-27 11:54:15 -08:00
Xin Dong 4ac7b36e44 Added back the mutex holder that was removed accidentally 2020-02-27 10:19:17 -08:00
Evan Tschannen 707fc1ddea only capture the policy to match prior code 2020-02-26 19:04:49 -08:00
Evan Tschannen c3299b8ebe if tls cannot be initialized, throw an error from createDatabase 2020-02-26 18:53:06 -08:00
Evan Tschannen bf5a95e6df Merge commit 'dc39bdfbbf94a7f470386f439df08c044d08d90c' into feature-tls-environment-vars
# Conflicts:
#	flow/Net2.actor.cpp
2020-02-26 18:02:56 -08:00
Evan Tschannen f035bed870 defer initializing TLS to avoid throwing errors from a constructor and so that errors can be logged to the trace file 2020-02-26 17:50:07 -08:00
A.J. Beamon 4bbac9d996 Change a special case return to -1. Update comments to clarify and correct some things. 2020-02-26 16:39:13 -08:00
Evan Tschannen f85af10a18 fixed a few problems with tls setup 2020-02-26 16:06:45 -08:00
Evan Tschannen d1598e7c99 set_verify_peers throws an error instead of returning a value 2020-02-26 16:06:16 -08:00
Evan Tschannen 2586bade68 re-added support for configuration TLS options with environment variables 2020-02-26 15:33:48 -08:00
A.J. Beamon 0f5c999d4b Better containment of boost errors related to TLS. 2020-02-26 12:26:43 -08:00
Steve Atherton 087c6fa33d
Merge branch 'master' into feature-redwood 2020-02-26 12:25:04 -08:00
Xin Dong 74c929d98d Fix windows build, again 2020-02-26 10:01:08 -08:00
Evan Tschannen 924d335aa7 Merge branch 'release-6.2'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	flow/Knobs.cpp
#	flow/Knobs.h
2020-02-25 18:25:19 -08:00
Xin Dong 7b51ab6b63 Rebased with master 2020-02-25 15:43:33 -08:00
Xin Dong f20619c9fb Resolve review comments. Changed how issues got cleared 2020-02-25 15:39:51 -08:00
Xin Dong 3f24ae93f2 Remove the unused variable 2020-02-25 15:39:38 -08:00
Xin Dong 090c89e90a Addressed review comments. Fix the bug where issues on a worker may be wrongly cleared by subsequent GetDBinfo request. 2020-02-25 15:39:38 -08:00
Xin Dong aaa63331b6 Fix windows build 2020-02-25 15:39:09 -08:00
Xin Dong 288e95c7e1 Reallocate the issues set after each get. Changed an issues name to be accurate 2020-02-25 15:39:09 -08:00
Xin Dong 1c346fcfb0 Added the new issues into Status Schema. Remove the issue reporting in lastError since:
- If the issue string contains the error number, status schema needs to be super verbose to include all possible issue strings
- If the issue string does not contain the error number, the generic issue string can be pretty useless.

Thus now specific issues are being reported before calling lastError
2020-02-25 15:38:14 -08:00
Xin Dong 39c92c9cce Update flow/FileTraceLogWriter.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-02-25 15:38:14 -08:00
Xin Dong f4f860bfa8 Changed issue reporting to be thread safe. Also changed the liveness ping to be thread safe. 2020-02-25 15:38:14 -08:00
Xin Dong a6580dc15f Added the ability to ping a trace log writer thread and the monitoring in worker.actor.cpp. The current solution is simple a loose check. We can change this to be accurate check by using 'pthread_kill(writer_thread, 0)' 2020-02-25 15:37:53 -08:00
Xin Dong 0b0414fb94 Addressded review comments. Change the issue reporting from 'ITraceLogWriter' to be a more generic way. 2020-02-25 15:37:53 -08:00
Xin Dong 034dfe5e42 Now the inability to flush trace logs will be reported to both 'stderr' and also the status json object.
- Since the first flush failure, if the accumulated consecutive failure count exceeds the value defined in knobs, it will trigger the current worker process to report this issue via the 'GetServerDBInfo' interface of the cluster controler
    - A successful flush will reset the accumulated counter.
    Notice that the current solution does not take the time into consideration. The assumption is that flush failures tend to only happen in a clustered manner. The intermittent, but short, periods of flush failures are not considered as a problem since the memory pressure built by them should be negligible.
2020-02-25 15:37:32 -08:00
A.J. Beamon 0f7656e52e Document roughness. Remove an unexplained factor of 2 and handle window edges better. Subtract 1 from roughness to correspond better to variance. 2020-02-25 08:45:51 -08:00
A.J. Beamon 1c6aef76b5 When one of the sqlite reader or writer thread pools fail, fail the other with the same error. 2020-02-24 12:39:04 -08:00
Alvin Moore 9585cd10f1 Removed duplicate CMake link request 2020-02-24 00:19:43 -08:00
Alvin Moore 0f64505d0b Merge branch 'release-6.2' of github.com:apple/foundationdb
Needed to pull in changes to build docker
2020-02-23 23:27:53 -08:00
Steve Atherton 712aa27896 Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-redwood 2020-02-23 00:30:27 -08:00
Evan Tschannen 65fbe0d0bc revert AcceptSocket priority change because of bad performance results 2020-02-21 19:22:14 -08:00
Evan Tschannen 96258b9809 Merge branch 'release-6.2'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbcli/fdbcli.actor.cpp
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/FlowTransport.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistribution.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/KeyValueStoreMemory.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/QuietDatabase.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/StorageMetrics.actor.h
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/KVStoreTest.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/genericactors.actor.cpp
#	flow/serialize.h
2020-02-21 19:09:16 -08:00
Steve Atherton f1ec780b31 Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-redwood 2020-02-21 17:43:11 -08:00
A.J. Beamon 4c696d5bf2 Merge branch 'release-6.2' into dd-better-rebalance-logging
# Conflicts:
#	fdbserver/DataDistributionQueue.actor.cpp
2020-02-21 17:41:00 -08:00
A.J. Beamon dfa5f76c01 Remove unused parameter. Don't put check for g_network presence in ASSERT_WE_THINK. 2020-02-21 16:28:03 -08:00
A.J. Beamon 2431d4d788 Always compute the time for a trace event when it is being logged rather than when it is being created. Usually these are the same, but if they aren't, doing the opposite can lead to out of order trace events. 2020-02-21 13:57:04 -08:00
A.J. Beamon 6810a03283 Add more logging to valley filler and mountain chopper 2020-02-21 10:55:14 -08:00
Alvin Moore 90b4050eca Added required include for stringstream 2020-02-21 09:59:11 -08:00
Alvin Moore d02d84a577 Added required include for std:set which is for some reason only missing within Windows build 2020-02-21 09:36:24 -08:00
Alvin Moore 9042cab7bc Changed ordering of link libraries 2020-02-21 08:56:52 -08:00
Evan Tschannen dc3826e2fd fix: tls throttling would re-insert the failure into the map 2020-02-20 18:17:39 -08:00
Evan Tschannen f04e311a1e Merge commit 'b46d6e25e24993ab5a5f04091fd3235050b7cd09' into feature-boost-ssl
# Conflicts:
#	fdbserver/SimulatedCluster.actor.cpp
#	flow/Net2.actor.cpp
2020-02-20 17:36:38 -08:00
Alex Miller 927cff3317 Report errors on TLS misconfigurations ... or at least try to. 2020-02-20 16:57:29 -08:00
Evan Tschannen d7c841a28a
Merge pull request #2589 from etschannen/feature-proxy-delay
Improve version pipelining on the proxy
2020-02-20 15:23:30 -08:00
Evan Tschannen 8129f74a10
Merge pull request #2698 from etschannen/feature-recruit-delay
The CC waits until no new workers register before starting a bad recruitment
2020-02-20 14:42:37 -08:00
Evan Tschannen 7d54acf4ca removed an unnecessary yield 2020-02-20 14:41:49 -08:00
A.J. Beamon 5586e6f6d8
Merge pull request #2697 from etschannen/feature-correctness-fixes
A variety of correctness fixes
2020-02-20 13:32:18 -08:00
Evan Tschannen 08c318d28a re-added the connect lock in the fdbcli so that the timeout is not spent before a connection has been initiated (because of the handshake lock) 2020-02-20 10:43:34 -08:00
Evan Tschannen 69b5a1fbe3 more priority improvements 2020-02-20 10:11:43 -08:00
Evan Tschannen fd8a58b035 re-added support for the TLS_DISABLED flag 2020-02-19 18:37:47 -08:00
Evan Tschannen 761da5a059 code cleanup 2020-02-19 17:59:45 -08:00
Evan Tschannen fbd45963d8 The cluster controller waits until no new workers register for 1.0 before starting a bad recruitment 2020-02-19 16:48:30 -08:00
Evan Tschannen 9b3254d5f4 A corrupted processId file should be deleted in simulation, as that is the manual operation that would fix the problem in the real world 2020-02-19 15:21:42 -08:00
Alex Miller fe78524bbc
Merge pull request #2678 from sears/networktest_perf
Add some tuning knobs to networktestclient; also, measure latency directly
2020-02-19 14:38:09 -08:00
Russell Sears 956a3efa80 Pull request comments 2020-02-19 10:55:05 -08:00
Alex Miller 88d36af9c7 Fix --tls_password and add better error logging
This refactors all tls settings into a TLSParams object so that we can
set the password before loading any certificates.

It turns out that the FDBLibTLS code did really nice things with error
logging, but I just didn't understand openssl enough before to realize
what pieces I should be copying.
2020-02-19 00:57:05 -08:00
Meng Xu 31a6ec34b7 Merge branch 'master' into mengxu/fast-restore-agent-PR 2020-02-18 16:17:59 -08:00
Alex Miller 9d88356468
Merge pull request #2686 from mpilman/features/avoid-unnecessary-template-instanciations
Removed dead code
2020-02-17 14:46:39 -08:00
Alex Miller 9144c3e8ca
Merge pull request #2087 from atn34/issue-1226
Allow member actors access to private variables
2020-02-17 14:39:31 -08:00
mpilman aac94a766b Removed dead code 2020-02-15 21:56:48 -08:00
A.J. Beamon 649fc6ba94
Merge pull request #2329 from davisp/trace-clock-source-network-option
Add network option for the trace clock source
2020-02-15 10:43:00 -08:00
Paul J. Davis 32e285a761 Add network option for the trace clock source
This option allows clients to select the clock source for trace events
similar to the `--traceclock` command line parameter for `fdbserver`.
Using the `realtime` clock sources makes loading event data into
OpenTracing systems like Jaeger more useful.
2020-02-15 11:30:43 -06:00
Markus Pilman ccf590e193 Merge branch 'master' of github.com:apple/foundationdb into features/boost70 2020-02-14 22:05:51 -08:00
mpilman 579444419a remove call to `get_io_service` 2020-02-14 21:22:14 -08:00
mpilman 3a1e878a9b Upgrade to boost 1.72 2020-02-14 18:10:13 -08:00
Evan Tschannen 693e469003 Changed the handshake lock to a BoundedFlowLock, which will enforce that old handshakes complete before starting to initiate new handshakes 2020-02-14 16:49:52 -08:00
Evan Tschannen 321dded7dd rely on preverified to verify the certificate 2020-02-14 16:45:04 -08:00
Alex Miller 94e7f790d8
Merge pull request #2667 from atn34/atn34/remove-flatbuffers-knob
Remove USE_OBJECT_SERIALIZER knob
2020-02-14 15:44:38 -08:00
Alex Miller 723a70b357 Call X509_verify_cert once and implement time checking by hand 2020-02-13 21:31:36 -08:00
Alex Miller d716c50000 Find OpenSSL or LibreSSL in CMake 2020-02-13 21:31:36 -08:00
Alex Miller 8298fb3cb5 Remove spammy traceevent from testing 2020-02-13 21:31:36 -08:00
Russell Sears 7724c644e5 Add some tuning knobs to networktestclient; also, measure latency directly. 2020-02-13 13:11:54 -08:00
Steve Atherton 0c7c815396 Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-redwood
# Conflicts:
#	tests/CMakeLists.txt
2020-02-12 16:12:57 -08:00
Andrew Noyes 1248d2b8b4 Remove USE_OBJECT_SERIALIZER knob 2020-02-12 10:41:52 -08:00
Steve Atherton 93e3e36d52 Changed RedwoodRecordRef::compare() to include value and updated VersionedBTree to adapt to this change. This fixes an (uncommitted) bug where DeltaTree inserts of a record matching a deleted record except for value would simply unhide the deleted record. For DeltaTree delete/insert sequences to work correctly compare() must only return 0 when the records are fully equivalent. 2020-02-12 01:18:35 -08:00
Meng Xu e76b6d824a FastRestore:Assign priority to actors to prioritize vb work
When we pipeline multiple version batches, we should prevent a later
version batch from blocking the earlier version batch by consuming
CPU resources.

To achive the above, we should assign higher priority to actors
in later phases in a version batch.

Because restore master will not invoke an actor at a later phase unless
the actors at the earlier phases have been finished. This priority assignment
will not cause dead lock.
2020-02-10 20:29:23 -08:00
Evan Tschannen dcbce3593e fixed TLS in simulation 2020-02-10 14:00:21 -08:00
mpilman 5a9d420cb7 Merge remote-tracking branch 'upstream/release-6.2' into release-merges/20200210 2020-02-10 10:02:05 -08:00
A.J. Beamon ff44bd2b33
Merge pull request #2639 from atn34/atn34/include-port-in-address-default
Enable include_port_in_address by default for api version 700
2020-02-10 09:50:59 -08:00
Markus Pilman e71fe44ee3
Merge branch 'master' into features/icc 2020-02-08 21:33:02 -08:00
A.J. Beamon abb75f7eb7 Add logging to indicate the time spent at each priority that exceeds some minimum busyness threshold 2020-02-07 14:34:24 -08:00
A.J. Beamon 6010d835fb Reorganize the interaction between slow task checking and check_yield 2020-02-07 10:35:09 -08:00
Alex Miller 2a2bf945ef Also remove FDBLibTLS from CMake 2020-02-06 21:55:13 -08:00
Alex Miller e390dbd36c Add a non-FDBLibTLS verify peers framework to new TLS impl 2020-02-06 21:06:52 -08:00
Evan Tschannen 38d8d0d675 fixed simulation 2020-02-06 19:29:31 -08:00
Evan Tschannen 69de430057 separate handshaking from connection to improve pipelining 2020-02-06 16:45:54 -08:00
A.J. Beamon df2b0452b4 Step 3 of fixing storage server range reads: change return type of readRange from VectorRef<KeyValueRef> to RangeResultRef. 2020-02-06 13:19:24 -08:00
Evan Tschannen 53d0867a17 limit the number of connections a process can attempt to establish in parallel 2020-02-04 18:15:10 -08:00
Evan Tschannen c9738ab133 do not destroy an ssl connection until async_handshake has returned 2020-02-04 17:54:03 -08:00
Andrew Noyes fcefb4bf6d
Merge branch 'master' into issue-1226 2020-02-04 17:46:36 -08:00
Evan Tschannen 84853dd1fd switched SSL implementation to use boost ssl 2020-02-04 14:56:40 -08:00
Evan Tschannen 8449badb3e
Merge pull request #1868 from dongxinEric/fix/1827/error_instead_of_timeout
Send error back before put the GRV request with PRIORITY_BATCH into t…
2020-02-04 14:32:47 -08:00
mpilman 35d4aef8ee Reverted `crc32` 2020-02-04 10:53:42 -08:00
mpilman 52ca752dd3 Merge remote-tracking branch 'origin/features/icc' into features/icc 2020-02-04 10:29:49 -08:00
mpilman d09e07f1f5 Merge remote-tracking branch 'upstream/master' into features/icc 2020-02-04 10:26:18 -08:00
Evan Tschannen 4524831456
Merge pull request #2518 from vishesh/task/failmon-remove-server
FailureMonitoring: Server processes no longer need to talk to ClusterController
2020-02-03 17:22:50 -08:00
Andrew Noyes 2ce887012c Respect api version for include_port_in_address 2020-02-03 15:25:30 -08:00
Xin Dong 7016f7903b Fixed another build error. Do not use timeReplyIgnoreError since we don not want the logging inside that function and thus that's unnecessary anymore. Change to use ready() which basically ignores the error. 2020-01-31 15:48:29 -08:00
Alex Miller ee6490c9d1
Merge pull request #2314 from mengranwo/memory-engine
New Radix-Tree based Memory Storage Engine
2020-01-30 16:20:13 -08:00
Xin Dong 7216961e46 Do not time the error. 2020-01-30 14:13:56 -08:00
Xin Dong e21426d12a Send error back to the GRV requests with batch priority when the cluster is saturated, instead of blindly enqueue the requests and let the client timeout. 2020-01-30 14:13:56 -08:00
A.J. Beamon cdeb0ee35b Merge branch 'master' into slow-task-and-priority-tracking-improvements 2020-01-30 09:54:50 -08:00
A.J. Beamon 809586ec31 When logging boost errors, include message in addition to the error code 2020-01-30 08:55:41 -08:00
A.J. Beamon 182dac7cd5 Convert the slow task profiler into a run loop profiler that also logs when the run loop is 100% busy for a knob-configurable duration. 2020-01-28 12:09:37 -08:00
Evan Tschannen 9a620f3e6c fixed bug in timer() 2020-01-27 17:43:59 -08:00
Evan Tschannen 231d7830a0 more accurate calculation on the amount of time that proxy should wait before getting a version from the master 2020-01-26 19:47:12 -08:00
Alex Miller 2bc5b2cf8a
Merge pull request #2585 from Ma27/fix-glibc230-build
Fix build with glibc 2.30
2020-01-23 20:21:32 -08:00
Evan Tschannen 76e192d490
Merge pull request #2538 from alexmiller-apple/hashlittle2-to-crc32c
Convert more hashlittle{,2} uses to crc32c_append
2020-01-23 17:54:38 -08:00
Evan Tschannen 6c0b934dda
Merge pull request #2242 from alexmiller-apple/fix-10min-stall-again
Fix the 10min multi-region recovery stall again
2020-01-23 17:53:02 -08:00
Maximilian Bosch e133cb974b
Fix build with glibc 2.30
The `gettid()` function is part of glibc 2.30[1]. I decided to keep the
`gettid` implementation here under a different name to remain compatible
to older glibc versions.

[1] https://sourceware.org/ml/libc-alpha/2019-08/msg00029.html
2020-01-23 09:28:18 +01:00
Jingyu Zhou 17002740bb Add epoch and backup workers to DBCoreState
This enables backup workers to know the end version of the epoch. Additionally,
the master recovery only needs to deal with crashed backup workers by
recruiting new workers to backup the unfinished version range.
2020-01-22 19:38:45 -08:00
Jingyu Zhou a4d6ebe79e Recruit backup worker in newEpoch 2020-01-22 19:37:48 -08:00
Evan Tschannen 78adbea834 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	flow/Knobs.h
#	versions.target
2020-01-21 21:38:19 -08:00
Evan Tschannen afd3ec13ff added knobs 2020-01-21 18:58:34 -08:00
Alex Miller 1cb311fcb8 Add an ASSERT_WE_THINK that peek cursors don't get timed_out()
This should prevent us from regressing and having multi-region
recoveries hang for 10min again.
2020-01-21 17:07:37 -08:00
Evan Tschannen 7a4b459f07 wait for a tls handshake to complete before returning a connection
wait for multiple tls errors before throttling
2020-01-21 16:45:15 -08:00
Vishesh Yadav daef5f011a Merge remote-tracking branch 'apple/master' into task/failmon-remove-server 2020-01-21 13:20:15 -08:00
Evan Tschannen 3f9d9d8b84 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	cmake/FlowCommands.cmake
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/StorageServerInterface.h
#	fdbserver/DataDistributionTracker.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	flow/Knobs.h
#	flow/Platform.cpp
#	versions.target
2020-01-16 18:37:47 -08:00
A.J. Beamon f3d2a1990e Revert unintended change 2020-01-15 15:53:48 -08:00
A.J. Beamon 173802f27e Add some tracking for time spent in the run loop to the priority tracker. Add more frequent measurements for priority changes. Fix slow tasks not being measured in some cases. 2020-01-15 15:51:15 -08:00
mengranwo 115ff8bf65 change getKey() interface, pass in uint8_t * only 2020-01-15 13:49:45 -08:00
mengranwo 6836e370a7 revert changes inside KVStoreTest, ready for code review 2020-01-15 13:49:45 -08:00
mengranwo f597aa7e18 WIP : deployable/stable version since Nov 3. Start rebase to master branch 2020-01-15 13:49:45 -08:00
Evan Tschannen e65760eb46
Merge pull request #2536 from etschannen/feature-commit-latency
Improved commit latency in large clusters
2020-01-13 19:12:02 -08:00
Alex Miller 31fbf84ac5 Make FastAlloc use crc32c instead of hashlittle2 2020-01-13 18:23:12 -08:00
Alex Miller da73164eda Move crc32c from fdbrpc to flow
So that we can use it from a piece of flow code without breaking module
boundaries.

Also rename generated-constants to crc32c-generated-constants so that
it's more apparent that they're related files.
2020-01-13 18:19:30 -08:00
Evan Tschannen 0e916fdbed throttle client TLS errors longer than server errors so that when both happen simultaneously the server throttling will be disabled when the client makes its next attempt 2020-01-12 22:12:18 -08:00
Evan Tschannen 1f7eb1f738 throttle outgoing tls connections before establishing a network connection
store serverTLSConnectionThrottler map inside of g_network, so that it works properly with simulation
2020-01-12 16:44:30 -08:00
Evan Tschannen ef5dfb87dc
Merge pull request #2529 from bnamasivayam/tls-throtlling
Establishing TLS connection through the handshake process is expensiv…
2020-01-12 14:56:21 -08:00
Balachandar Namasivayam 741aa523e6 Establishing TLS connection through the handshake process is expensive and the fdbserver process can get easily saturated with doing repeated TLS handshakes with only a few hundreds of clients have bad certificate. Hence throttle the number of handshakes done on the server per client ip if it has a bad certificate. 2020-01-10 16:19:41 -08:00
Evan Tschannen 2e20c12200
Merge pull request #2475 from ajbeamon/priority-busy-fixes
Fix PriorityBusy calculation and add PriorityMaxBusy
2020-01-10 12:47:17 -08:00
Evan Tschannen 176a1b6319
Merge pull request #2515 from ajbeamon/remove-timer-in-slowtask-profiler
Fix slow task profiler crash
2020-01-10 12:41:57 -08:00
Evan Tschannen a5f544818c
Merge pull request #2420 from ajbeamon/trace-clock-source-fix
Revert change to make g_trace_clock thread_local, ...
2020-01-10 12:36:38 -08:00
Alvin Moore 7628d04fb9 Merge branch 'release-6.2' of github.com:apple/foundationdb into release_6.2_merge
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2020-01-09 07:21:16 -08:00
Vishesh Yadav 598b2eaeb0 fdbrpc: Add warning when peer is unavailable for long time 2020-01-08 13:55:13 -08:00
Evan Tschannen 83ad9caf54 implemented a load balancing algorithm which evens out the number of requests processes by each proxy 2020-01-08 01:59:01 -08:00
A.J. Beamon de5a591b15 Attempt a minor pointless change to fix the build 2020-01-06 15:17:13 -08:00
A.J. Beamon 6cf38790d6 Reorganize declaration of variable and add release note. 2020-01-06 12:27:56 -08:00
A.J. Beamon 4a52864023 Remove call of timer() from the slow task profiling signal handler, as it can lead to crashes if called at the wrong time. 2020-01-06 12:19:45 -08:00
Evan Tschannen 16b5af067c changed trace event name 2020-01-03 16:03:29 -08:00
Evan Tschannen deb032745a fix: do not set logged until then end of the function 2020-01-03 12:45:23 -08:00
Evan Tschannen 1867d30017 added asserts to protect against future actions on a trace event that has been logged 2020-01-03 12:31:06 -08:00
Evan Tschannen 7152469cc3 log the base trace event before the endpoint messages 2020-01-03 12:15:38 -08:00
Evan Tschannen 6e473c3a83 Merge branch 'release-6.2' into feature-addpeer-fix 2020-01-02 17:37:23 -08:00
Evan Tschannen 032797ca5c
Merge pull request #2430 from etschannen/release-6.2
Reduce recovery times caused by saturating the cluster controller
2020-01-02 17:35:59 -08:00
A.J. Beamon 3dd3ac3cfd Merge branch 'release-6.2' into trace-clock-source-fix
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2020-01-02 15:14:12 -08:00
A.J. Beamon ca01593067 Cap busyness to 1.0 at logging time to cover all cases where it could be measured above. 2020-01-02 15:10:42 -08:00
Evan Tschannen 9e137d3b49 fix: addPeerReference only marks a connection as healthy if it is the first peerReference
added additional logging to long LoadBalance calls, and when the failure monitor state changes for an address
2019-12-19 18:26:29 -08:00
A.J. Beamon 3b28c7f103 Throw the correct error in deleteFile 2019-12-19 14:13:09 -08:00
A.J. Beamon 414be7a0e4
Merge pull request #2479 from AlvinMooreSr/release_6.2_merge
Release 6.2 merge
2019-12-19 09:36:05 -08:00
Evan Tschannen d8c3c2fda4 Improved prioritization of commit path on the proxies 2019-12-18 16:56:35 -08:00
Vishesh Yadav 902029bbec
Merge pull request #2448 from mpilman/issues/2446
Changed failure monitor ping delay to 1 second
2019-12-18 13:26:34 -08:00
Alvin Moore 21390c493a Merge branch 'release-6.2' of github.com:apple/foundationdb into release_6.2_merge
Resolved merge by keeping new test file from master branch: SampleNoSimAttrition.txt adding new constraint from Release branch about existing test file: SimpleExternalTest.txt

# Conflicts:
#	tests/CMakeLists.txt
2019-12-18 09:05:08 -08:00
A.J. Beamon a093021855 Fix priority time calculation. Track max priority busy rather than seconds squared. 2019-12-17 09:14:54 -08:00
Alvin Moore 5080b8293a Added Windows required library for function GetProcessMemoryInfo 2019-12-16 08:09:10 -08:00
Alvin Moore 3bf971ba8b Merge branch 'release-6.2' of github.com:apple/foundationdb into release_6.2_merge
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/storageserver.actor.cpp
2019-12-12 07:13:12 -08:00
mpilman 7a62d3b526 Changed failure monitor ping delay to 1 second 2019-12-11 11:23:24 -08:00
Evan Tschannen 3c30215662 Merge branch 'release-6.2' of github.com:apple/foundationdb into release-6.2 2019-12-09 13:18:07 -08:00
A.J. Beamon 20eacdb434 Add missing include 2019-12-06 15:18:17 -08:00
A.J. Beamon 9866d1ce27 Revert change to make g_trace_clock thread_local, instead checking we are on the correct thread when getting the time. 2019-12-06 10:15:49 -08:00
Andrew Noyes 78b202f3a4 Apply A.J.'s suggestion to randomInt as well 2019-12-05 11:01:41 -08:00
Andrew Noyes cf5cdc4e93
Update flow/DeterministicRandom.cpp
Include equality now that we've adjusted the value by 1.

Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-12-05 11:01:03 -08:00
Andrew Noyes b09f0b334b Take A.J.'s suggestion, which fixes A.J.s counterexample 2019-12-05 10:27:35 -08:00
Andrew Noyes 96e71bb109 Fix old Make build 2019-12-04 17:06:49 -08:00
Andrew Noyes 89a093e035 Accept UBSAN's suggestion
/home/anoyes/workspace/foundationdb/flow/DeterministicRandom.cpp:72:29: runtime error: negation of -9223372036854775808 cannot be represented in type 'long int'; cast to an unsigned type to negate this value to itself
2019-12-04 16:45:45 -08:00
Andrew Noyes 4f943be21d Move DeterministicRandom impl to its own translation unit
This will allow me to recompile faster after making changes, and should
(slightly) speed up overall compilation.

I manually verified that the unseed matched for one test before and
after this change, so I probably didn't screw up the refactor
2019-12-04 16:45:32 -08:00
Evan Tschannen 5a6bc2aa71 increase the priority of cluster controller recruitment to prefer recruitment over sending serverDBInfo 2019-12-04 16:28:41 -08:00
Evan Tschannen 5f1ef53f62 increase the priority at which the cluster controller registers workers to avoid having a saturated cluster controller recruit a master without all available workers 2019-12-04 16:17:41 -08:00
Andrew Noyes 46d10dc7dc Fix "null passed as argument declared not null"
Fix several such reports from ubsan

E.g.

/Users/anoyes/workspace/foundationdb/flow/Arena.h:794:16: runtime error: null pointer passed as argument 1, which is declared to never be null
2019-12-03 14:46:53 -08:00
Andrew Noyes 4022ca1381 Add USE_UBSAN cmake option 2019-12-02 16:20:06 -08:00
Andrew Noyes 7f263a2614 Require callers to alignment requirements for aligned_alloc 2019-12-02 15:30:54 -08:00
Evan Tschannen 07331ab5fd
Merge pull request #2362 from etschannen/master
Merge 6.2 into master
2019-12-02 15:04:27 -08:00
Andrew Noyes ff8758b1fd Request alignment that's at least sizeof(void*)
According to https://en.cppreference.com/w/c/memory/aligned_alloc#Notes,
aligned_alloc may return nullptr if it doesn't like the requested
alignment. Let's also detect if nullptr is returned.
2019-12-02 12:51:33 -08:00
Meng Xu a2c84d932d
Merge pull request #2396 from atn34/atn34/fix-invalid-vptr
Fix invalid vptr
2019-11-27 21:16:49 -08:00
Meng Xu 7eaf76bacf
Merge pull request #2389 from atn34/atn34/default-init-flatbuffers
Default initialize absent flatbuffers members
2019-11-27 21:16:27 -08:00
Andrew Noyes 8fc74e3182 Fix UBSAN error
Since QuorumCallback<T> is a non trivial type, we need to construct it
before we interact with it

This change fixes the following UBSAN message
/Users/anoyes/workspace/foundationdb/flow/genericactors.actor.h:930:18: runtime error: member access within address 0x0001243f63d0 which does not point to an object of type 'Callback<Standalone<StringRef> >'
0x0001243f63d0: note: object has invalid vptr
2019-11-27 13:16:48 -08:00
Andrew Noyes 3a9fd29d3c Initialize memory for Optional and ErrorOr
This does _not_ fix any potential uses of uninitialized memory. Without
this change, gcc issues false-positive -Wuninitialized warnings

I'm hoping this does not have a noticeable impact on performance
2019-11-26 11:31:30 -08:00
Andrew Noyes c4e01301b0 Fix a potential UB instance
Writing a value which is not 0 or 1 to the underlying memory of a bool
is undefined behavior. Conformant flatbuffers implementations must
accept bytes that are not 0 or 1 as booleans [1]. (Conformant
implementations are only allowed to write the byte 0 or 1 as a boolean
[1])

So this protects us from undefined behavior if we ever read a
flatbuffers message written by an almost-conformant implementation.

[1]: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#boolean
2019-11-26 11:18:17 -08:00
Andrew Noyes 17ab2f8e00 Default initialize absent flatbuffers members 2019-11-26 10:58:29 -08:00
Evan Tschannen 3c769fcf60 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	versions.target
2019-11-22 15:39:19 -08:00
Evan Tschannen ebcb2f79ed Merge branch 'master' of github.com:apple/foundationdb 2019-11-22 15:34:49 -08:00
Meng Xu cb77c7dd47
Merge pull request #2355 from atn34/more-no-discard
Refactor Notified
2019-11-22 12:09:10 -08:00
Evan Tschannen 27cb299d84 simulation can sometimes randomly hang or throw connection_failed, instead of always doing one or the other 2019-11-21 16:24:18 -08:00
Evan Tschannen 2727b91c46 simulation tests network connections failing due to errors instead of just hanging 2019-11-21 12:33:07 -08:00
Evan Tschannen dbfa3dc217
Merge pull request #2200 from negoyal/storage-cache-subfeature1
Storage cache subfeature1
2019-11-20 13:59:06 -08:00
A.J. Beamon b5a450b4c6 Fix capitalization error 2019-11-15 12:41:08 -08:00
A.J. Beamon ed8d3f163c Rename hgVersion to sourceVersion. 2019-11-15 12:26:51 -08:00
Evan Tschannen 8d3ef89540 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/MutationList.h
#	fdbserver/MasterProxyServer.actor.cpp
#	versions.target
2019-11-14 15:49:56 -08:00
Andrew Noyes b4aa72303f Add [[nodiscard]] for whenAtLeast, and make Notified generic 2019-11-13 13:30:34 -08:00
negoyal a4a0bf18f9 Merging with Master. 2019-11-12 13:01:29 -08:00
Evan Tschannen 1e5677b55a increase the priority of reboot and recruitment requests 2019-11-11 15:17:11 -08:00
Meng Xu e7210fe842 Trace:Resolve review comments and add SevVerbose level 2019-11-05 09:42:29 -08:00
Meng Xu c4d1e6e1a9 Trace:Severity:Include SevNoInfo to mute trace
Define SevFRMutationInfo to trace mutations in restore.
2019-11-04 16:18:40 -08:00
Jingyu Zhou 00b3c8f48a
Revert "Clean up some memory after network thread exits" 2019-11-01 11:05:31 -07:00
Jingyu Zhou 6c28da9093 Clean up some memory after network thread exits 2019-10-29 13:59:55 -07:00
Andrew Noyes b7b5d2ead3 Remove several nonsensical const uses
These seem to be all the ones that clang's -Wignored-qualifiers
complains about
2019-10-26 14:30:34 -07:00
Andrew Noyes e4acd2e318 Disable TLS temporarily for OPEN_FOR_IDE build 2019-10-25 10:42:22 -07:00
Evan Tschannen 3325980c03 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/OldTLogServer_6_0.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/WorkerInterface.actor.h
#	fdbserver/worker.actor.cpp
#	versions.target
2019-10-24 17:38:15 -07:00
mpilman 325a8e4213 remove confusing USE_ODIRECT knob 2019-10-24 11:44:03 -07:00
mpilman f23392ec5a Don't use O_DIRECT in EIO by default 2019-10-24 11:39:55 -07:00
mpilman 7ad0e20e48 Added knob to disable O_DIRECT 2019-10-24 11:20:14 -07:00
mpilman f41f19b5f6 Introduced knob to set eio parallelism 2019-10-24 11:20:14 -07:00
Stephen Atherton 0e51a248b4 Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-redwood 2019-10-23 10:12:54 -07:00
Evan Tschannen 688940b685 merge 6.2 into master 2019-10-21 11:43:46 -07:00
mpilman a79757a788 Fix compiler errors on Catalina
Fixes #2263
2019-10-21 11:15:37 -07:00
Evan Tschannen 42b7acf7b7
Merge pull request #2202 from etschannen/feature-share-mutations
Backup and DR would not share mutations if started on different versions of FDB
2019-10-16 20:28:39 -07:00
Evan Tschannen 35ef9b32de fix: if establishing a TLS connection took longer than 10ms, we could spend all our CPU establishing new connections instead of pinging to maintain existing connections, leading to an infinite loop 2019-10-16 17:26:01 -07:00
Andrew Noyes b149aee260 Include hgVersion.h in FLOW_SRCS
This way if we rebuild after reconfiguring, the binaries will pick up
the new hgVersion.h
2019-10-16 02:13:20 -07:00
A.J. Beamon 3ba8fd95b5 Add script to parse output from enabling ALLOC_INSTRUMENTATION_STDOUT 2019-10-08 15:50:47 -07:00
Jingyu Zhou 396b10caca Add memory profiling for FastAlloc when gperftool is used
FastAlloc is the major memory use case in FDB, yet we can't profiling its usage.
This commit replaces FastAlloc memory allocation with malloc so that we may
track its memory usage when gperftool is used.
2019-10-07 19:27:06 -07:00
Evan Tschannen 1b946d588f
Merge pull request #2208 from alexmiller-apple/faster-txstag-recovery
Recover Txs Faster [0/?]: Combine spill-by-value and spill-by-reference into one file/SharedTLog
2019-10-07 11:15:56 -07:00
Alex Miller d38a96ab73 Make LogData aware of the spill type it was created to perform.
The spilling type is now pulled out of the request, and then stored on
LogData for later access, and persisted in the tlog metadata per tlog
generation.

It turns out that serializing types as Unversioned is a bit wonky.
2019-10-03 01:45:10 -07:00
Meng Xu d0147e5e5d Merge branch 'release-6.2' into mengxu/merge-release620-to-master-v3
Resolved Conflicts:
	documentation/sphinx/source/release-notes.rst
	fdbserver/DataDistribution.actor.cpp
	versions.target
2019-10-02 13:22:56 -07:00